Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: thing-model-catalog: identification of ThingModels #10

Open
alexbrdn opened this issue Oct 25, 2023 · 17 comments
Open

proposal: thing-model-catalog: identification of ThingModels #10

alexbrdn opened this issue Oct 25, 2023 · 17 comments
Assignees

Comments

@alexbrdn
Copy link

alexbrdn commented Oct 25, 2023

[Description]

Identification of ThingModels

Thing Models need to be unambiguously identified for the purpose of referencing them within the context of Thing Model Catalog (TMC) (e.g. search results, fetching from TMC) or outside the TMC (e.g. TDs or other TMs in the field.

This proposal intends to describe how TMC is going to handle identification of TMs.

Note that the w3c TD standard does not provide guidance on this so far (see w3c/wot-thing-description#1905). Hence, whatever solution we implement, it might become incompatible with the standard later.

Context

Each TM within the catalog will include manufacturer's name, manufacturer part number, author's name, and version (together - identifying fields). These data, apart from version, will have to be provided at import (and their presence will be enforced by the importing API).
The version may be included in the TM or it may need to be generated automatically when importing a TM.

Requirements for identification system

  • different versions of TMs for the same device need to be unambiguously identifiable so that fetching a TM from TMC can be reliably reproduced
  • IDs should be valid relative IRI references
  • IDs should be human-readable and include manufacturer, part number, and author
  • A TMC may or may not be exposed over HTTP API. The IDs should be usable regardless.

Use cases to be considered

  • A user should be able to import a TM definition into TMC from a local file
  • The search function should present the user with a list of found TMs including their IDs. The user can directly copy/paste these IDs into a command to fetch the TMs.
  • A remote repository can be added at any time to the TMC and the IDs of TMs should never be the same for TMs that have different content. In case there are the same IDs in multiple remotes and the file contents are the same, there's no real conflict between them

[How]

Proposed Identification

Each TM within the catalog is uniquely identified by the combination of identifying fields.
The id field of a TM is composed of these fields as follows:

id: [author_name "/"] manufacturer_name "/" mpn "/" ([ optional_path_part "/" ])* version ".tm.json"

author_name, manufacturer_name, and mpn must be present in the TM at the following paths, respectively: $/schema:author/name, $/schema:manufacturer/name, $/schema:mpn. These fields are defined by https://schema.org/author, https://schema.org/manufacturer, and https://schema.org/mpn. All three fields are sanitized for use as parts of URI path by replacing all consequent whitespace and special characters not allowed in base file names with "-".

When author_name and manufacturer_name are the same, the author_name is omitted from the id.

Optional path parts may be added by the author when importing to TMC.

This id schema can be closely followed by the storage schema. It lets a contributor define her own hierarchy for TMs and any additional files that may be provided along with the TMs.

Handling of the version field

The version field of the id closely follows the format of Golang modules' pseudo-version numbers. It has the following format:

version: base_version "-" timestamp "-" content_hash
  • base_version (vX.Y.Z) is a value derived from the value of $/version/model field, if present and can be parsed as semver, or else is set to "v0.0.0".
  • timestamp (yymmddhhmmss) is the UTC time of when the file was uploaded to TMC.
  • content_hash is the 12 character prefix of the sha1 hash of TM file contents. The hash is calculated after first removing the $/id field from the file, so that the contents except the id, which contains a timestamp can be compared.

Details on implementing specific use cases

Importing a TM

Importing presents two distinct sub-cases with regard to identifying the TM after it's been imported:

  1. The TM being imported has no id

Generate an id as described above.

  1. The TM already has an id.

If the id does not conform to the TM id schema described above, move the original id to a link relation of type "original". Generate an id as per TM id schema. Do the same if the id does conform to this schema, but identifying fields in the id are not equal to those in the TM.

If the id does conform to the TM id schema, generate the new version value.

After the id of the TM being imported has been determined, compare it with existing TMs. For this purpose, versions with the same base version and the same content hash are considered equal, irrespectful of the timestamp value. The TMC CLI/API may abort importing if the TMC already contains the same TM with equivalent contents apart from version timestamp.

Identifying a TM in search results

Search results will print out the name of the remote where the result has been found and the id, including the version.

Fetching a TM

The id with version as returned by the search should be enough to fetch a TM.
A TM can also be fetched by an incomplete ID, where the version field is skipped (referring to the latest version), or the version field contains only the base semantic version (refers to the latest of versions with the same base version part)

Referencing a TM

It is up to the Consumer to refer to a TM by its id as provided in the TM file (i.e. relative IRI), or by resolved absolute IRI.

Privacy Considerations

With regards to Privacy Considerations outlined in WoT TD standard (https://www.w3.org/TR/wot-thing-description11/#sec-privacy-consideration), a Consumer working in a privacy-sensitive
context SHOULD NOT include the link to the TM in generated TDs. In this way, leaking of potentially private information contained in a TM's id will be avoided.

[Documentation]

<--- OPTIONAL: if you feel it is needed, provide a related documentation as described in the README.md of the repository --->

cc => @hadjian @andrisciu

@egekorkan
Copy link
Member

Open_question: how can the version be generated, if it is to follow the standard's recommendation of SEMVER format? Generating random versions (e.g. uuid) prevents ordering of versions and automatically determining which version is the latest.

There is a planned work on this in the standardization. If you have any input on how this should happen, we can happily take this input. My initial idea is to use the same concepts from API versioning. E.g. removing an affordance or changing its data schema is a breaking change, adding an affordance, description, or title is a new feature, and fixing a typo in a description/title is a patch. In case this is standardized, it will be very probably part of the discovery spec, which will also contain TMs, which means that TM directories may check for version changes (it is a bit of a stretch but a realistic one). E.g. if there is a breaking change but the major version is not incremented, TM is rejected.

@alexbrdn
Copy link
Author

EDIT: restructured text, outlined use cases relevant for TM identification considerations

@alexbrdn
Copy link
Author

alexbrdn commented Oct 31, 2023

EDIT: added proposal on how to hande versions of TMs in IDs

@a-hennig
Copy link

a-hennig commented Nov 6, 2023

I think we cannot assume, that the TM comes from the manufacturer him/herself. Thats probably why the author_name is proposed as first element in the path. To avoid accidental confusion between "authoring responsible entity" and the "person writing it", we could rename it to origin_name or similar

@a-hennig
Copy link

a-hennig commented Nov 6, 2023

If I see an instantiated TD, I want to be able to check its authenticity ... so I dont think leaving out the reference to the TM is a good thing (and didnt get, how it helps on privacy). Leaving the source in might also be needed for copyright / author's acknowledgement (of the entity authoring it, not the person unless so chosen)

we also need a way to verify integrity, i.e. that it hasnt been manipulated.

@daHaimi
Copy link

daHaimi commented Nov 6, 2023

I would opt for identification and URL-resolution by some standard URI like purl (e.g. https://github.com/package-url/purl-spec)

This would allow for thi URI identify the thin as TM and define where it can be found/parsed to URL or local path, e.g.:

This also requires semantic versioning and allows for addressing "standards" and "alternatives"

@hadjian
Copy link
Contributor

hadjian commented Nov 6, 2023

I think we cannot assume, that the TM comes from the manufacturer him/herself. Thats probably why the author_name is proposed as first element in the path. To avoid accidental confusion between "authoring responsible entity" and the "person writing it", we could rename it to origin_name or similar

Yep, we noticed the confusion when talking about it. Should name it "authority" or something.

@hadjian
Copy link
Contributor

hadjian commented Nov 6, 2023

If I see an instantiated TD, I want to be able to check its authenticity ... so I dont think leaving out the reference to the TM is a good thing (and didnt get, how it helps on privacy). Leaving the source in might also be needed for copyright / author's acknowledgement (of the entity authoring it, not the person unless so chosen)

we also need a way to verify integrity, i.e. that it hasnt been manipulated.

Agree. Also it addresses consumers in "privacy sensitive" environments. Doesn't have an impact on this proposal. @alexbrdn why did you include the paragraph?

@hadjian
Copy link
Contributor

hadjian commented Nov 8, 2023

I think we cannot assume, that the TM comes from the manufacturer him/herself. Thats probably why the author_name is proposed as first element in the path. To avoid accidental confusion between "authoring responsible entity" and the "person writing it", we could rename it to origin_name or similar

Agree, but we are using terms from schema.org and there is no authority or origin_name there. Actually author can have a organization or person value, so per schema.org it is correct. Maybe some other terms from CreativeWork are more clear. https://schema.org/CreativeWork:

  • schema:publisher
  • schema:producer
  • schema:provider
  • schema:creator (same as author)

@a-hennig @alexbrdn what do you think?

@hadjian
Copy link
Contributor

hadjian commented Nov 8, 2023

@alexbrdn if the original id is actually a templated id, like currently suggested in the standard, do we also move it to a link relation?

@a-hennig
Copy link

a-hennig commented Nov 9, 2023 via email

@a-hennig
Copy link

a-hennig commented Nov 9, 2023 via email

@alexbrdn
Copy link
Author

If I see an instantiated TD, I want to be able to check its authenticity ... so I dont think leaving out the reference to the TM is a good thing (and didnt get, how it helps on privacy). Leaving the source in might also be needed for copyright / author's acknowledgement (of the entity authoring it, not the person unless so chosen)

we also need a way to verify integrity, i.e. that it hasnt been manipulated.

Including or leaving out the reference to the TM in TD is up to the producer of TD. I have made that remark to raise awareness that the id may include potentially privacy-relevant information, as noted in the standard.

Integrity verification is out of scope for this proposal and, in my opinion, is in no way hampered by the proposed identification scheme.

@alexbrdn
Copy link
Author

Maybe some other terms from CreativeWork are more clear. https://schema.org/CreativeWork:

  • schema:publisher
  • schema:producer
  • schema:provider
  • schema:creator (same as author)

@a-hennig @alexbrdn what do you think?

I don't see much difference between producer and author for our purposes. Provider and publisher seem less fitting. I'd stick with author (or creator)

@alexbrdn
Copy link
Author

@alexbrdn if the original id is actually a templated id, like currently suggested in the standard, do we also move it to a link relation?

In case it has the correct mandatory fields for the file being imported (same author, etc.), then no. Otherwise, yes.

@alexbrdn
Copy link
Author

EDIT: updated proposal in light of in-person discussion on Nov 6th. the largest change is that the version now builds the file name.

N.B. that as it currently stands, in the general case, an ID cannot be unambiguously parsed without knowing whether the TM is "official", i.e. authored by manufacturer, or not.
For example, "omnicorp/senseall/temperature/v1.0.0-20231115134253-abcdef012345.tm.json" may mean

author_name = "omnicorp"
manufacturer_name = "senseall"
mpn = "temperature"
optional_path_parts = ""

or it may mean

author_name = "omnicorp"
manufacturer_name = "omnicorp"
mpn = "senseall"
optional_path_parts = "temperature"

Opinions on this change are very much welcome

@alexbrdn alexbrdn added Incoming and removed Draft labels Nov 14, 2023
@alexbrdn
Copy link
Author

@daHaimi
#10 (comment)

purl seems to be rather specific for software packages. I do not see how it can be bent to cover all our requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants