Skip to content

Commit

Permalink
Merge pull request #296 from ResearchObject/referencing-crates
Browse files Browse the repository at this point in the history
How to reference and retrieve another RO-Crate
  • Loading branch information
stain authored Aug 29, 2024
2 parents 9f9bd85 + fedb143 commit d92d9f1
Show file tree
Hide file tree
Showing 3 changed files with 149 additions and 16 deletions.
148 changes: 146 additions & 2 deletions docs/_specification/1.2-DRAFT/data-entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -396,7 +396,98 @@ Note that if a local file is intended to be packaged within an _Attached RO-Crat

### Directories on the web; dataset distributions

A _Directory File Entry_ or [Dataset] identifier expressed as an absolute URL on the web can be harder to download than a [File] because it consists of multiple resources. It is RECOMMENDED that such directories have a complete listing of their content in [hasPart], enabling download traversal.
A _Directory File Entry_ or [Dataset] identifier expressed as an absolute URL on the web can be harder to download than a [File] because it consists of multiple resources. It is RECOMMENDED that such directories have a complete listing of their content in [hasPart], enabling download traversal, or are themselves RO-Crates.

#### Referencing other RO-Crates

A referenced RO-Crate is also a [Dataset] data entity, but where its [hasPart] do not need to be listed. Instead, its content and further metadata is available from its own RO-Crate Metadata Document, which may be retrieved or packaged within an archive. The referenced RO-Crate entity SHOULD have `conformsTo` pointing to the generic RO-Crate profile using the fixed URI `https://w3id.org/ro/crate`.

This section defines how a _referencing_ RO-Crate ("A") can declare data entities within A's RO-Crate Metadata Document, in order to indicate a _referenced_ RO-Crate ("B"). There are different options on how to find the identifier to assign the referenced crate in A, and how a consumer of A finding such a reference can find the corresponding RO-Crate Metadata Document for B.

##### Referencing RO-Crates that have a persistent identifier

If the referenced RO-Crate B has an `identifier` declared as B's [Root Data Entity identifier](root-data-entity#root-data-entity-identifier), then this is a _persistent identifier_ which SHOULD be used as the URI in the `@id` of the corresponding entity in RO-Crate A. For instance, if crate B had declared the identifier `https://pid.example.com/another-crate/` then crate A can reference B as an entity:

```json
{
"@id": "https://pid.example.com/another-crate/",
"@type": "Dataset",
"conformsTo": { "@id": "https://w3id.org/ro/crate" }
}
```

{.tip }
> The `conformsTo` generic RO-Crate profile on a `Dataset` entity MUST be version-less. The referenced crate B is NOT required to conform to the same version of the RO-Crate specification as A's RO-Crate Metadata Document.
{.warning }
> It is NOT RECOMMENDED to declare the generic profile `https://w3id.org/ro/crate` on a referencing crate A's own [root data entity](root-data-entity.html#direct-properties-of-the-root-data-entity), see [metadata descriptor](root-data-entity.html#ro-crate-metadata-descriptor).
Consumers that find a reference to a `Dataset` with the generic RO-Crate profile indicated MAY attempt to resolve the persistent identifier, but SHOULD NOT assume that the `@id` directly resolves to an RO-Crate Metadata Document. See section [Retrieving an RO-Crate](#retrieving-an-ro-crate) below for the recommended algorithm.

If an `identifier` is not declared in a referenced RO-Crate B, but the determined absolute URI has [Signposting] declared for a `Link:` with `rel=cite-as`, then that link MAY be considered as an equivalent permalink for B.


##### Determining entity identifier for a referenced RO-Crate

In some cases, if the referenced RO-Crate B has not got a resolvable `identifier` declared, additional steps are needed to find the correct `@id` to use:

1. If RO-Crate A is an [attached](structure.html#attached-ro-crate) crate, and RO-Crate B is a nested folder (e.g. `another-crate/`), then B SHOULD be treated as an attached crate (e.g. it has `another-crate/ro-crate-metadata.json`) and the relative path (`another-crate/`) used directly as `@id` as a [Directory File Entity](#directory-file-entity) within crate A.
2. If B's root data entity has an `@id` that is an absolute URI indicating a [detached crate](structure.html#detached-ro-crate)), and that URI resolves according to [Retrieving an RO-Crate](#retrieving-an-ro-crate), then that can be used as the `@id` of the `Dataset` entity in A, equivalent to the `identifier` case above. However, as that URI was not declared as a persistent identifier, the timestamp property [sdDatePublished] SHOULD be included to indicate when the absolute URL was accessed.
2. If B's RO-Crate Metadata Document was located on the Web, but uses a relative URI reference for its root data entity (`./`), then its absolute URI can be determined from the [RFC3986] algorithm for [establishing a base URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5). For example, if root `{"@id": "./" }` is in metadata document `http://example.com/another-crate/ro-crate-metadata.json`, then the absolute URI for the `Dataset` entity is `http://example.com/another-crate/` (with the trailing `/`). If that URI is resolvable as in point 1, it can be used as equivalent `@id`. It is NOT RECOMMENDED to resolve a relative root identifier if the metadata document was retrieved from a URI that does not end with `/ro-crate-metadata.json` or `/ro-crate-metadata.jsonld` -- these are not part of a valid [attached](structure.html#attached-ro-crate) or [detached](structure.html#detached-ro-crate) RO-Crate.
4. If the RO-Crate is not on the Web, and does not have a persistent identifier, e.g. is within a ZIP file or local file system, then a non-resolvable identifier could be established. See appendix [Establishing a base URI inside a ZIP file](appendix/relative-uris.html#establishing-a-base-uri-inside-a-zip-file), e.g. `arcp://uuid,b7749d0b-0e47-5fc4-999d-f154abe68065/` if using a randomly generated UUID. This method may also be used if the above steps fail for an RO-Crate Metadata Document that is on the Web. In this case, the referenced RO-Crate entity MUST either declare a [referenced metadata document](#referencing-another-metadata-document) or [distribution](#downloadable-dataset).

If the RO-Crate Metadata Document is not available as a web resource, but only within an archive (e.g. ZIP), then instead reference it as a [Downloadable dataset](#downloadable-dataset).

##### Referencing another metadata document

If a referenced RO-Crate Metadata Document is known at a given URI or path, but its corresponding RO-Crate identifier can't be determined as above (e.g. [Retrieving an RO-Crate](#retrieving-an-ro-crate) fails or requires heuristics), then a referenced metadata descriptor entity SHOULD be added. For instance, if `http://example.com/another-crate/ro-crate-metadata.json` resolves to an RO-Crate Metadata Document describing root `./`, but `http://example.com/another-crate/` always return a HTML page without [Signposting] to the metadata document, then `subjectOf` SHOULD be added to an explicit metadata descriptor entity, which has `encodingFormat` declared for JSON-LD:

```json
{
"@id": "http://example.com/another-crate/",
"@type": "Dataset",
"conformsTo": { "@id": "https://w3id.org/ro/crate" },
"subjectOf": { "@id": "http://example.com/another-crate/ro-crate-metadata.json" }
},
{
"@id": "http://example.com/another-crate/ro-crate-metadata.json",
"@type": "CreativeWork",
"encodingFormat": "application/ld+json",
"sdDatePublished": "2024-08-22T23:57:03+01:00"
}
```

{.tip }
> Counter to [file format profile](data-entities.html#file-format-profiles) recommendations, the referenced RO-Crate metadata descriptor SHOULD NOT include its own `conformsTo` declarations to `https://w3id.org/ro/crate` or reference the dataset with `about`; this is to avoid confusion with the referencing RO-Crate's own [metadata descriptor](root-data-entity#ro-crate-metadata-descriptor).

##### Profiles of referenced crates

If the referenced crate conforms to a given [RO-Crate profile](profiles), this MAY be indicated by expanding `conformsTo` on the `Dataset` to an array to reference the profile as an contextual entity:

```json
{
"@id": "https://doi.org/10.48546/workflowhub.workflow.26.1",
"@type": "Dataset",
"conformsTo": [
{ "@id": "https://w3id.org/ro/crate" },
{ "@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0"}
]
},
{ "@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0",
"@type": ["CreativeWork", "Profile"],
"name": "Workflow RO-Crate Profile",
"version": "1.0"
}
```

{.note}
> The profile declaration of a referenced crate is a hint. Consumers should check `conformsTo` as declared in the retrieved RO-Crate, as it may have been updated after this RO-Crate.


#### Downloadable dataset


Alternatively, a common mechanism to provide downloads of a reasonably sized directory is as an archive file in formats such as [`application/zip`](https://www.nationalarchives.gov.uk/PRONOM/x-fmt/263) or [`application/gzip`](https://www.nationalarchives.gov.uk/PRONOM/x-fmt/266), described as a [DataDownload].

Expand All @@ -416,9 +507,62 @@ Alternatively, a common mechanism to provide downloads of a reasonably sized dir
}
```

Similarly, the _RO-Crate root_ entity may also provide a [distribution] URL, in which case the download SHOULD be an archive that contains the _RO-Crate Metadata Document_.
Similarly, the _RO-Crate root_ entity (or a reference to another RO-Crate as a `Dataset`) may provide a [distribution] URL, in which case the download SHOULD be an archive that contains the _RO-Crate Metadata Document_ (either directly in the archive's root, or within a single folder in the archive), indicated by a version-less `conformsTo`:

```json
{
"@id": "./",
"@type": "Dataset",
"identifier": "https://doi.org/10.48546/workflowhub.workflow.775.1",
"name": "Research Object Crate for Jupyter Notebook Molecular Structure Checking",
"distribution": {"@id": "https://workflowhub.eu/workflows/775/ro_crate?version=1"},
"…": ""
},
{
"@id": "https://workflowhub.eu/workflows/775/ro_crate?version=1",
"@type": "DataDownload",
"encodingFormat": ["application/zip", {"@id": "https://www.nationalarchives.gov.uk/PRONOM/x-fmt/263"}],
"conformsTo": { "@id": "https://w3id.org/ro/crate" }
}
```

In all cases, consumers should be aware that a `DataDownload` is a snapshot that may not reflect the current state of the `Dataset` or RO-Crate.


#### Retrieving an RO-Crate

To resolve a reference to an RO-Crate, but where `subjectOf` or `distribution` is unknown (e.g. an RO-Crate is cited from a journal article), the below approach is recommended to retrieve its [RO-Crate Metadata Document](structure#ro-crate-metadata-document-ro-crate-metadatajson):

1. Assuming the URI is a permanlink, after following HTTP redirects without content negotiation, try [Signposting] to look for `Link` headers that reference `Link rel="describedby` for a _RO-Crate Metadata Document_, or `Link rel="item"` for a distribution archive -- in either case prefer a link with `profile="https://w3id.org/ro/crate"` declared. For example, signposting for `https://doi.org/10.48546/workflowhub.workflow.120.5` leads to the archive `https://workflowhub.eu/workflows/120/ro_crate?version=5` as:

```
curl --location --head https://doi.org/10.48546/workflowhub.workflow.120.5
HTTP/2 302
Location: https://workflowhub.eu/workflows/120?version=5
HTTP/2 200
Content-Type: text/html; charset=UTF-8
Link: <https://workflowhub.eu/workflows/120/ro_crate?version=5> ;
rel="item" ; type="application/zip" ;
profile="https://w3id.org/ro/crate"
```

2. [HTTP Content-negotiation] for the [RO-Crate media type](appendix/jsonld#ro-crate-json-ld-media-type), for example:

Requesting `https://w3id.org/workflowhub/workflow-ro-crate/1.0` with HTTP header

`Accept: application/ld+json;profile=https://w3id.org/ro/crate` redirects to the _RO-Crate Metadata file_
`https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json`
3. The above approaches may fail or return a HTML page, e.g. for content-delivery networks that do not support content-negotiation.
4. An optional heuristic fallback is to try resolving the path `./ro-crate-metadata.json` from the _resolved_ URI (after permalink redirects). For example:
If permalink `https://w3id.org/workflowhub/workflow-ro-crate/1.0` redirects to `https://about.workflowhub.eu/Workflow-RO-Crate/1.0/index.html` (a HTML page), then
try retrieving `https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json`.
5. If the retrieved resource is a ZIP file (`Content-Type: application/zip`), then extract `ro-crate-metadata.json`, or, if the archive root only contains a single folder (e.g. `folder1/`), extract `folder1/ro-crate-metadata.json`
6. If the retrieved resource is a [BagIt archive](appendix/implementation-notes#combining-with-other-packaging-schemes), e.g. containing a single folder `folder1` with `folder1/bagit.txt`, then extract and verify BagIt checksums before returning the bag's `data/ro-crate-metadata.json`
7. If the returned/extracted document is valid JSON and have a [root data entity](root-data-entity#finding-the-root-data-entity), this is the RO-Crate Metadata File.

{.tip }
Some PID providers such as DataCite may respond to content-negotiation and provide their own JSON-LD, which do not describe an RO-Crate (the `profile=` was ignored). The use of Signposting allows the repository to explicitly provide the RO-Crate.

{% include references.liquid %}
15 changes: 2 additions & 13 deletions docs/_specification/1.2-DRAFT/profiles.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,21 +161,10 @@ The rest of the [earlier requirements](#declaring-conformance-of-an-ro-crate-pro

### How to retrieve a Profile Crate

To resolve a Profile URI to a machine-readable _Profile Crate_, two approaches are recommended to retrieve its [RO-Crate Metadata Document](root-data-entity#ro-crate-metadata-descriptor):
To resolve a Profile URI to a machine-readable _Profile Crate_, follow the approaches of [retrieving an RO-Crate](data-entities#retrieving-an-ro-crate).

1. [HTTP Content-negotiation] for the [RO-Crate media type](appendix/jsonld#ro-crate-json-ld-media-type), for example:
If none of these approaches worked, then this profile probably does not have a corresponding Profile Crate. For human display of conformed profiles, display a hyperlink to its `@id` Web page, described by its `name`.

Requesting `https://w3id.org/workflowhub/workflow-ro-crate/1.0` with HTTP header

`Accept: application/ld+json;profile=https://w3id.org/ro/crate` redirects to the _RO-Crate Metadata file_
`https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json`

2. The above approach may fail (or returns a HTML page), e.g. for content-delivery networks that do not support content-negotiation. The fallback is to try resolving the path `./ro-crate-metadata.json` from the _resolved_ URI (after permalink redirects). For example:
If permalink `https://w3id.org/workflowhub/workflow-ro-crate/1.0` redirects to `https://about.workflowhub.eu/Workflow-RO-Crate/1.0/index.html` (a HTML page), then
try retrieving `https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json`
3. If none of these approaches worked, then this profile probably does not have a corresponding Profile Crate. For humans, display a hyperlink to its `@id` described by its `name`.

<!-- TODO Make both examples above actually work! -->


#### Shared contextual entities from a Profile Crate
Expand Down
2 changes: 1 addition & 1 deletion docs/_specification/1.2-DRAFT/root-data-entity.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ RO-Crates that have been assigned a _persistent identifier_ (e.g. a DOI) SHOULD
#### Resolvable persistent identifiers and citation text

It is RECOMMENDED that resolving the `identifier` programmatically return the _RO-Crate Metadata Document_ or an archive (e.g. ZIP) that contain the _RO-Crate Metadata File_, using [content negotiation](profiles#how-to-retrieve-a-profile-crate) and/or [Signposting]. With an RO-Crate identifier that is persistant and resolvable in this way from a URI, the root data entity SHOULD indicate this using the `cite-as` property according to [RFC8574]. Likewise, an HTTP/HTTPS server of the resolved RO-Crate Metadata Document or archive (possibly after redirection) SHOULD indicate that persistent identifier in its [Signposting] headers using `Link rel="cite-as"`.
It is RECOMMENDED that resolving the `identifier` programmatically return the _RO-Crate Metadata Document_ or an archive (e.g. ZIP) that contain the _RO-Crate Metadata File_, using [content negotiation](data-entities#retrieving-an-ro-crate) and/or [Signposting]. With an RO-Crate identifier that is persistant and resolvable in this way from a URI, the root data entity SHOULD indicate this using the `cite-as` property according to [RFC8574]. Likewise, an HTTP/HTTPS server of the resolved RO-Crate Metadata Document or archive (possibly after redirection) SHOULD indicate that persistent identifier in its [Signposting] headers using `Link rel="cite-as"`.

{: .tip}
> The above `cite-as` MAY go to a repository landing page, and MAY require authentication, but MUST ultimately have the RO-Crate as a downloadable item, which SHOULD be programmatically accessible through content negotiation or [Signposting] (`Link rel="describedby"` for a _RO-Crate Metadata Document_, or `Link rel="item"` for an archive). To rather associate a textual scholarly citation for a crate (e.g. journal article), indicate instead a [publication via `citation` property](contextual-entities#publications-via-citation-property).
Expand Down

0 comments on commit d92d9f1

Please sign in to comment.