From 0351f51167e134fbdfb5dce295d6bbf2d0c33fb9 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 13 Nov 2024 10:29:17 -0500 Subject: [PATCH 01/12] fix github links --- docs/decision_record.md | 65 +++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 31 deletions(-) diff --git a/docs/decision_record.md b/docs/decision_record.md index 7e57d31..fe1fdcb 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -8,6 +8,9 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "S [TOC] +## 2024-11-13 Attributes can be designed as `passthru` or `transient`. + + ## 2024-10-02 The `/collection` and `/attribute` endpoints will both be `REQUIRED` @@ -96,7 +99,7 @@ In the future if the number of proposed ancillary attributes grows, it could mov ### Linked issues -- +- ## 2024-02-21 We will specify core sequence collection attributes and a process for adding new ones @@ -120,9 +123,9 @@ Choosing to host this list as a list of issues allows the list to always be up t ### Linked issues - - - - - - + - + - + - ## 2024-01-10 Clarifications on the purpose and form of the JSON schema in service-info @@ -148,8 +151,8 @@ Another issue is that we wanted the schema to be a place where a user could see ### Linked issues -- -- +- +- ## 2024-01-06 The comparison function use more descriptive attribute names @@ -171,7 +174,7 @@ The comparison function is designed to compare two sequence collections by inter ### Linked issues -- +- ## 2023-08-25 The user-facing API will neither expect nor provide prefixes @@ -236,7 +239,7 @@ properties: ### Linked issues -- https://github.com/ga4gh/seqcol-spec/issues/40 +- https://github.com/ga4gh/refget/issues/40 ## 2023-07-26 There will be no metadata endpoint @@ -256,9 +259,9 @@ We distinguished between two types of metadata: ### Linked issues -- -- -- +- +- +- ## 2023-07-12 - Required attributes are: lengths and names @@ -302,7 +305,7 @@ This leads us to the conclusion that *sequences* should be optional, and *names* ### Linked issues -- +- ## 2023-06-14 - Internal digests SHOULD NOT be prefixed @@ -335,7 +338,7 @@ Adding prefixes will complicate things and does not add benefits. Prefixes may b ### Linked issues -- +- ## 2023-06-28 - SeqCol JSON schema defines reserved attributes without additional namespacing @@ -400,7 +403,7 @@ Thus, we introduce the idea of *inherent* vs *non-inherent attributes*. Inherent ### Linked issues -- +- ### Alternatives considered @@ -420,7 +423,7 @@ While non-ASCII array names would be compatible with our current specification, ### Linked issues -- +- ## 2023-01-25 - The digest algorithm will be the GA4GH digest @@ -449,7 +452,7 @@ Under this scheme the string `ACGT` will result in the `sha512t24u` digest `aKF4 ### Linked issues -- [https://github.com/ga4gh/seqcol-spec/issues/30](https://github.com/ga4gh/seqcol-spec/issues/30) +- [https://github.com/ga4gh/refget/issues/30](https://github.com/ga4gh/refget/issues/30) ## 2023-01-12 - How sequence collection are serialized prior to digestion @@ -536,9 +539,9 @@ It also future-proofs the serialisation method if we ever allow complex object t ### Linked issues - - [https://github.com/ga4gh/seqcol-spec/issues/1](https://github.com/ga4gh/seqcol-spec/issues/1) - - [https://github.com/ga4gh/seqcol-spec/issues/25](https://github.com/ga4gh/seqcol-spec/issues/25) - - [https://github.com/ga4gh/seqcol-spec/issues/33](https://github.com/ga4gh/seqcol-spec/issues/33) + - [https://github.com/ga4gh/refget/issues/1](https://github.com/ga4gh/refget/issues/1) + - [https://github.com/ga4gh/refget/issues/25](https://github.com/ga4gh/refget/issues/25) + - [https://github.com/ga4gh/refget/issues/33](https://github.com/ga4gh/refget/issues/33) ### Known limitations @@ -636,7 +639,7 @@ We should be consistent by using these terms to refer to the above representatio ### Linked issues -- +- ## 2022-06-15 - Structure for the return value of the comparison API endpoint @@ -704,8 +707,8 @@ The primary purpose of the compare function is to provide a high-level view of h ### Linked issues -- -- +- +- ### Alternatives considered @@ -778,8 +781,8 @@ We need a formal definition of a sequence collection. The schema provides a mach ### Linked issues -- -- +- +- ## 2021-12-01 - Endpoint names and structure @@ -825,8 +828,8 @@ For the `POST comparison` endpoint, we made 2 limitations to simplify the implem ### Linked issues -- [https://github.com/ga4gh/seqcol-spec/issues/21](https://github.com/ga4gh/seqcol-spec/issues/21) -- [https://github.com/ga4gh/seqcol-spec/issues/23](https://github.com/ga4gh/seqcol-spec/issues/23) +- [https://github.com/ga4gh/refget/issues/21](https://github.com/ga4gh/refget/issues/21) +- [https://github.com/ga4gh/refget/issues/23](https://github.com/ga4gh/refget/issues/23) ## 2021-09-21 - Order will be recognized by digesting arrays in the given order, and unordered digests will be handled as extensions through additional attributes @@ -854,7 +857,7 @@ To conclude, option A seems simple and straightforward, satisfies for a basic im ### Linked issues -- https://github.com/ga4gh/seqcol-spec/issues/5 +- https://github.com/ga4gh/refget/issues/5 ### Known limitations @@ -877,7 +880,7 @@ However, there are also scenarios for which the order of sequences in a collecti ### Linked issues -- [https://github.com/ga4gh/seqcol-spec/issues/5](https://github.com/ga4gh/seqcol-spec/issues/5) +- [https://github.com/ga4gh/refget/issues/5](https://github.com/ga4gh/refget/issues/5) ### Known limitations @@ -917,8 +920,8 @@ This will allow retrieving individual attributes, and testing for identity of in ### Linked issues -- [https://github.com/ga4gh/seqcol-spec/issues/8#issuecomment-773489450](https://github.com/ga4gh/seqcol-spec/issues/8#issuecomment-773489450) -- [https://github.com/ga4gh/seqcol-spec/issues/10](https://github.com/ga4gh/seqcol-spec/issues/10) +- [https://github.com/ga4gh/refget/issues/8#issuecomment-773489450](https://github.com/ga4gh/refget/issues/8#issuecomment-773489450) +- [https://github.com/ga4gh/refget/issues/10](https://github.com/ga4gh/refget/issues/10) ### Known limitations @@ -937,7 +940,7 @@ Should a wider GA4GH standard appear from [TASC issue 5](https://github.com/ga4g ### Linked issues -- [https://github.com/ga4gh/seqcol-spec/issues/2](https://github.com/ga4gh/seqcol-spec/issues/2) +- [https://github.com/ga4gh/refget/issues/2](https://github.com/ga4gh/refget/issues/2) ### Known limitations From b1dc384a56ec7fc01c71d4abed03f155619e4bd2 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 13 Nov 2024 12:13:29 -0500 Subject: [PATCH 02/12] write up recent decisions and changes --- docs/decision_record.md | 36 +++++++ docs/seqcol.md | 230 +++++++++++++++++++++++++++------------- 2 files changed, 195 insertions(+), 71 deletions(-) diff --git a/docs/decision_record.md b/docs/decision_record.md index fe1fdcb..1cbb184 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -10,7 +10,43 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "S ## 2024-11-13 Attributes can be designed as `passthru` or `transient`. +### Decision + +We add two new attribute qualifiers: transient and passthru. + +- Passthru attributes are not digested in transition from level 2 to level 1. Most attributes of the canonical (level 2) seqcol representation are digested to create the level 1 representation. But sometimes, we have an attribute for which digesting makes little sense. These attributes are passed through the transformation, so they show up on the level 1 representation in the same form as the level 2 representation. Thus, we refer to them as passthru attributes. +Transient attributes + +- Transient attributes are not retrievable from the attribute endpoint. Most attributes of the sequence collection can be retrieved through the /attribute endpoint. However, some attributes may not be retrievable. For example, this could happen for an attribute that we intend to be used primarily as an identifier. In this case, we don't necessarily want to store the original content that went into the digest into the database, because it might be redundant. We really just want the final attribute. These attributes are called transient because the content of the attribute is no longer stored and is therefore no longer retrievable. + +### Rationale + +As we worked on more advanced attributes, and with the addition of the `/attribute` endpoint, we realized these changes necessitate a bit more power for the schema to specify behavior of the attributes. For the basic seqcol attributes (names, lengths, sequences) and original endpoint, the general algorithm and basic qualifiers (required, inherent, collated) suffice to describe the representation. But some more nuanced attributes require additional qualifiers to describe their intention and how the server should be behave for the `/attribute` endpoint. For example, sorted_name_length_pairs and sorted_sequences are intended to provide alternative tailored identifiers and comparisons, and not necessarily useful for independent attribute lookup. Similarly, custom extra attributes, like author or alias, may be simple appendages that don't need the complex digesting procedure we use for the basic attributes. In order to flag such attributes in a way that can govern slightly different server expectations, we need a couple of additional advanced attribute qualifiers. For this purpose, we added the passthru and transient qualifiers. + +### Linked issues + +- + + +## 2024-10-02 Minimal schema should now require sequences, and lengths should not be inherent. + +### Decision + +We will update the minimal schema with these changes: 1. Move sequences into 'required', and 2. remove lengths from 'inherent'. So the final qualifiers would be: +- required: names, lengths, and sequences +- inherent: names, sequences + + +### Rationale + +Originally, there was a good rationale for making sequences not required, to allow for coordinate systems to be represented as a seqcol. +But with the new `/attribute` endpoint, there's a better way to handle it, using `name_length_pairs` and `sorted_name_length_pairs` attributes. +Then, with sequences required, it does not make sense for lengths to be inherent because they are computable from sequences. +So essentially, the attribute endpoint allows us to move away from handling coordinate systems as top-level entities, and instead moves toward using the attribute endpoint for coordinate systems. + +### Linked issues +- ## 2024-10-02 The `/collection` and `/attribute` endpoints will both be `REQUIRED` diff --git a/docs/seqcol.md b/docs/seqcol.md index 5daa2b9..64d6eeb 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -30,7 +30,7 @@ A common example and primary use case of sequence collections is for reference g In brief, the project specifies several procedures: 1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digests are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. -2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies a RESTful API to retrieve the sequence collection given a digest. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible. +2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies an http API to retrieve the sequence collection given a digest. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible. 3. **Recommended ancillary, non-inherent attributes.** Finally, the protocol defines several recommended procedures that will improve the compatibility across Seqcol servers, and beyond. ## Use cases @@ -64,13 +64,14 @@ Some other examples of common use cases where the use of seqcol is beneficial in ## Definitions of key terms +### General terms + - **Alias**: A human-readable identifier used to refer to a sequence collection. - **Array**: An ordered list of elements. -- **Collated**: A qualifier applied to a seqcol attribute indicating that the values of the attribute matches 1-to-1 with the sequences in the collection and are represented in the same order. - **Coordinate system**: An ordered list of named sequence lengths, but without actual sequences. - **Digest**: A string resulting from a cryptographic hash function, such as `MD5` or `SHA512`, on input data. -- **Inherent**: A qualifier applied to a seqcol attribute indicating that the attribute is part of the definition of the sequence collection and therefore contributes to its digest. - **Length**: The number of characters in a sequence. +- **Qualifier**: A reserved term used in the schema to indicate a quality of an attribute, such as whether it is required, collated, or inherent. Qualifiers are listed below. - **Seqcol algorithm**: The set of instructions used to compute a digest from a sequence collection. - **Seqcol API**: The set of endpoints defined in the *retrieval* and *comparison* components of the seqcol protocol. - **Seqcol digest**: A digest for a sequence collection, computed according to the seqcol algorithm. @@ -80,6 +81,13 @@ Some other examples of common use cases where the use of seqcol is beneficial in - **Sequence collection**: A representation of 1 or more sequences that is structured according to the sequence collection schema - **Sequence collection attribute**: A property or feature of a sequence collection (*e.g.* names, lengths, sequences, or topologies). +### Attribute qualifiers + +- **Collated**: A qualifier applied to a seqcol attribute indicating that the values of the attribute match 1-to-1 with the sequences in the collection and are represented in the same order. +- **Inherent**: A qualifier applied to a seqcol attribute indicating that the attribute is part of the definition of the sequence collection and therefore contributes to its digest. +- **Passthru**: A qualifier applied to a seqcol attribute indicating that the attribute is *not digested* in transition from level 2 to level 1. So its value on level 1 representation the same as the level 2 representation. +- **Transient**: A qualifier applied to a seqcol attribute indicating that the attribute *cannot be retrieved through the `/attribute` endpoint*. + ## Seqcol protocol functionality The seqcol algorithm is based on the refget sequence algorithm for individual sequences and should use refget sequence servers to store the actual sequence data. @@ -90,13 +98,14 @@ The seqcol protocol defines the following: 1. *Schema* - The way an implementation should define the attributes of sequence collections it holds. 2. *Encoding* - An algorithm for computing a digest given a sequence collection. -3. *API* - A server RESTful API specification for retrieving and comparing sequence collections. +3. *API* - A server API specification for retrieving and comparing sequence collections. 4. *Ancillary attribute management* - A specification for organizing non-inherent metadata as part of a sequence collection. ### 1. Schema: Defining the attributes in the collection The first step for a Sequence Collections implementation is to define the *list of contents*, that is, what attributes are allowed in each collection, and which of these affect the digest. -The sequence collections standard is flexible with respect to the schema used, so implementations of the standard can use the standard with different schemas, as required by a particular use case. This divides the choice of content from the choice of algorithm, allowing the algorithm to be consistent even in situations where the content is not. +The sequence collections standard is flexible with respect to the schema used, so implementations of the standard can use the standard with different schemas, as required by a particular use case. +This divides the choice of content from the choice of algorithm, allowing the algorithm to be consistent even in situations where the content is not. This is an example of a general, minimal schema: @@ -131,16 +140,18 @@ properties: required: - names - lengths -inherent: - - lengths - - names - sequences +ga4gh: + inherent: + - names + - sequences ``` -This example schema is the minimal standard schema. We RECOMMEND that all implementations use this as a base schema, adding additional attributes as needed, but *without changing the inherent attributes list*, because this will keep the identifiers compatible across implementations. +This example schema is the minimal standard schema. Sequence collection objects that follow this basic minimal structure are said to be the *canonical seqcol object representation*. -Adding custom attributes to this schema will not break interoperability. -Nevertheless, extending this schema is only RECOMMENDED; implementations are still compliant if using custom schemas with custom inherent attributes. +We RECOMMEND that all implementations use this as a base schema, adding additional attributes as needed, which will not break interoperability. +We RECOMMEND *not changing the inherent attributes list*, because this will keep the identifiers compatible across implementations. +Implementations that use different inherent attributes are still compliant with the specification generally, but do so at the cost of top-level digest interoperability. For more information about community-driven updates to the standard schema, see [*Footnote F8*](#f8-adding-new-schema-attributes). @@ -194,10 +205,8 @@ The implementation `MUST` define its structure in a JSON Schema, such as the exa Implementations `MAY` choose to extend this schema by adding additional attributes. Implementations `MAY` also use a schema, but we `RECOMMEND` the schema extend the base schema defined above. This schema extends vanilla JSON Schema in two ways; first, it provides the `collated` qualifier. -For further details about the rationale behind collated attributes, see [*Footnote F2*](#f2-collated-attributes). Second, it specifies the `inherent` qualifier. -For further details about the rationale and examples of non-inherent attributes, see [*Footnote F3*](#f3-details-of-inherent-and-non-inherent-attributes). -Finally, another detail that may be unintuitive at first is that in the minimal schema, the `sequences` attribute is optional; for an explanation of why, see [*Footnote F4*](#f4-sequence-collections-without-sequences). +For further details about attribute qualifiers, see [*Section 4*](#4-extending-the-schema-schema-attribute-qualifiers). ##### Filter non-inherent attributes @@ -230,7 +239,7 @@ b'["SQ.2648ae1bacce4ec4b6cf337dcae37816","SQ.907112d17fcb73bcab1ed1c72b97ce68"," _* The above Python function suffices if (1) attribute keys are restricted to ASCII, (2) there are no floating point values, and (3) for all integer values `i`: `-2**63 < i < 2**63`_ - Also, notice that in this process, RFC-8785 is applied only to objects; we assume the sequence digests are computed through an external process (the refget sequences protocol), and are not computed as part of the sequence collection. The refget sequences protocol digests sequence strings without JSON-canonicalization. For more details, see [*Footnote F5*](#f5-rfc-8785-does-not-apply-to-refget-sequences). +Also, notice that in this process, RFC-8785 is applied only to objects; we assume the sequence digests are computed through an external process (the refget sequences protocol), and are not computed as part of the sequence collection. The refget sequences protocol digests sequence strings without JSON-canonicalization. For more details, see [*Footnote F5*](#f5-rfc-8785-does-not-apply-to-refget-sequences). #### Step 3: Digest each canonicalized attribute value using the GA4GH digest algorithm. @@ -266,6 +275,11 @@ wqet7IWbw2j2lmGuoKCaFlYS_R7szczz ``` +#### Exception for passthru attributes + +The above canonicalization/digesting procedure is applied by default to all attributes of a sequence collection; however, there can be some exceptions. +Any attribute qualified in the schema as a *passthru* attribute is NOT digested in this way. + #### Terminology Because the encoding algorithm is recursive, this leads to a few different ways to represent a sequence collection. We refer to these representations in "levels". The level number represents the number of "lookups" you'd have to do from the "top level" digest. So, we have: @@ -313,7 +327,7 @@ What you'd get with **2 database lookups** (1 recursive call). This is the most ``` -### 3. API: A server RESTful API specification for retrieving and comparing sequence collections. +### 3. API: A server API specification for retrieving and comparing sequence collections. The API has these top-level endpoints: @@ -362,10 +376,11 @@ properties: required: - names - lengths -inherent: - - lengths - - names - sequences +ga4gh: + inherent: + - sequences + - names ``` @@ -490,6 +505,12 @@ Example `/attribute/collection/names/:digest` return value: ["A","B","C"] ``` +The attribute endpoint MUST be functional for any attribute defined in the schema, *except those marked as transient or passthru*. +The endpoint MAY respond to requests for attributes marked as *passthru*. +The endpoint SHOULD NOT respond to requests for attributes marked as *transient*. +For more information on transient and passthru attributes, see [Section 4](#4-extending-the-schema-schema-attribute-qualifiers). + + ##### Definition of `object_type` The `/list` and `/attribute` endpoints both use an `:object_type` path parameter. The `object_type` should always be the *singular* descriptor of objects provided by the server. In this version of the Sequence Collections specification, the `object_type` is always `collection`; so the only allowable endpoints would be `/list/collection` and `/attribute/collection/:attribute_name/:digest`. We call this `object_type` because future versions of the specification may allow retrieving lists or attributes of other types of objects. @@ -499,8 +520,98 @@ The `/list` and `/attribute` endpoints both use an `:object_type` path parameter In addition to the primary top-level endpoints, it is RECOMMENDED that the service provide `/openapi.json`, an OpenAPI-compatible description of the endpoints. ---- -### 4. Ancillary attribute management: recommended non-inherent attributes + + + +### 4. Extending the schema: Schema attribute qualifiers + +#### 4.1 Introduction to attribute qualifiers + +The Sequence Collections specification is designed to be extensible. +This will let you build additional capability on top of the minimal Sequence Collections standard. +You can do this by extending the schema to include ancillary custom attributes. +To allow other services to understand something about what these attributes are for, you can annotate them in the schema using *attribute qualifiers*. +This allows you to indicate what *type* of attribute your custom attributes are, which govern how the service should respond to requests. +This section will describe the 4 attribute qualifiers you may add to the schema to qualify custom attributes. + +#### 4.2 Collated attributes + +Collated attributes are attributes whose values match 1-to-1 with the sequences in the collection and are represented in the same order. +A collated attribute by definition has the same number of elements as the number of sequences in the collection. +It is also in the same order as the sequences in the collection. + + +#### 4.3 Inherent attributes + +Inherent attributes are those that contribute to the digest. +The specification in section 1, *Encoding*, described how to structure a sequence collection and then apply an algorithm to compute a digest for it. +What if you have ancillary information that goes with a collection, but shouldn't contribute to the digest? +We have found a lot of useful use cases for information that should go along with a seqcol, but should not contribute to the *identity* of that seqcol. +This is a useful construct as it allows us to include information in a collection that does not affect the digest that is computed for that collection. +One simple example is the "author" or "uploader" of a reference sequence; this is useful information to store alongside this collection, but we wouldn't want the same collection with two different authors to have a different digest! Seqcol refers to these as *non-inherent attributes*, meaning they are not part of the core identity of the sequence collection. +Non-inherent attributes are defined in the seqcol schema, but excluded from the `inherent` list. + +See: [ADR on 2023-03-22 regarding inherent attributes](decision_record.md#2023-03-22-seqcol-schemas-must-specify-inherent-attributes) + +#### 4.4 Passthru attributes + +Passthru attributes are *not digested* in transition from level 2 to level 1. +In other words, the value of a passthru attribute is the same in the level 1 and level 2 representations +This is not the case for most attributes; most attributes of the canonical (level 2) seqcol representation are digested to create the level 1 representation. +But sometimes, we have an attribute for which digesting makes little sense. +These attributes are passed through the transformation, so they show up on the level 1 representation in the same form as the level 2 representation. +Thus, we refer to them as passthru attributes. + + +#### 4.5 Transient attributes + +Transient attributes are those that *cannot be retrieved* through the `/attribute` endpoint. +Most attributes of the sequence collection can be retrieved through the `/attribute` endpoint. +However, some attributes may not be retrievable. +For example, this could happen for an attribute that we intend to be used primarily as an identifier. +In this case, we don't necessarily want to store the original content that went into the digest into the database, because it might be redundant or whatever. +We really just want the final attribute. +These attributes are called transient because the content of the attribute is no longer stored and is therefore no longer retrievable. + +#### 4.6 Method of specifying attribute qualifiers + +In JSON Schema, there are 2 ways to qualify properties: 1) a local qualifier, using a key under a property; or 2) an object-level qualifier, which is specified with a keyed list of properties up one level. +For example, you annotate a property's `type` with a local qualifier, underneath the property, like this: + +```console +properties: + names: + type: array +``` + +However, you specify that a property is `required` by adding it to an object-level `required` list that's parallel to the `properties` keyword: + +```console +properties: + names: + type: array +required: + - names +``` + +In sequence collections, we define `collated` as a local qualifier. +Local qualifiers fit better for qualifiers independent of the object as a whole. +They are qualities of a property that persist if the property were moved onto a different object. +For example, the `type` of an attribute is consistent, regardless of what object that attribute were defined on. +In contrast, object-level qualifier lists fit better for qualifiers that depend on the object as a whole. +They are qualities of a property that depend on the object context in which the property is defined. +For example, the `required` modifier is not really meaningful except in the context of the object as a whole. +A particular property could be required for one object type, but not for another, and it's really the object that induces the requirement, not the property itself. + + +We reasoned that `inherent`, `transient`, and `passthru` are global qualifiers, like `required`, which describe the role of an attribute in the context of the whole object. +For example, an attribute that is inherent to one type of object need not be inherent to another. +Therefore, it makes sense to treat this concept the same way JSON schema treats `required`. +In contrast, the idea of `collated` describes a property independently: Whether an attribute is collated is part of the definition of the attribute; if the attribute were moved to a different object, it would still be collated. + +Finally, the 3 global qualiers are grouped under the 'ga4gh' key for consistency with other GA4GH specifications, and to group the seqcol-specific extended functionality into one place. + +### 5. Ancillary attribute management: recommended non-inherent attributes In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. Non-inherent attributes provide a standardized way for implementations to store and serve additional, third-party attributes that do not contribute to the digest. @@ -508,12 +619,28 @@ As long as separate implementations keep such information in non-inherent attrib Furthermore, the structure for how such non-inherent metadata is retrieved will be standardized. Here, we specify standardized, useful non-inherent attributes that we recommend. -#### 4.1 The `sorted_name_length_pairs` attribute (`RECOMMENDED`) +#### 5.1 The `name_length_pairs` attribute (`RECOMMENDED`) + +The `name_length_pairs` attribute is a *non-inherent* attribute of a sequence collection with a formal definition, provided here. +This attribute provides a way to look up the ordered coordinate system (the "chrom sizes") for a sequence collection. +It is created deterministically from the `names` and `lengths` attributes in the collection; it *does not* depend on the actual sequence content, so it is consistent across two collections with different sequence content if they have the same `names` and `lengths`, which are correctly collated. +This attribute is `RECOMMENDED` to allow retrieval of the coordinate system for a given reference sequence collections. -The `sorted_name_length_pairs` attribute is a *non-inherent* attribute of a sequence collection with a formal definition, provided here. +Algorithm: + +1. Lump together each name-length pair from the primary collated `names` and `lengths` into an object, like `{"length":123,"name":"chr1"}`. +2. Build a collated list, corresponing to the names and lengths of the object (*e.g.* `[{"length":123,"name":"chr1"},{"length":456,"name":"chr2"}],...`) +3. Add as a collated attribute to the sequence collection object. + +The `name_length_pairs` attribute is *not inherent*, *not passthru*, and *not transient*. + +#### 5.1 The `sorted_name_length_pairs` attribute (`RECOMMENDED`) + +The `sorted_name_length_pairs` attribute is similar to the `name_length_pairs` attribute, but it is sorted. When digested, this attribute provides a digest for an order-invariant coordinate system for a sequence collection. Because it is *non-inherent*, it does not affect the identity (digest) of the collection. -It is created deterministically from the `names` and `lengths` attributes in the collection; it *does not* depend on the actual sequence content, so it is consistent across two collections with different sequence content if they have the same `names` and `lengths`, which are correctly collated, but with pairs not necessarily in the same order. +but with pairs not necessarily in the same order. + This attribute is `RECOMMENDED` to allow unified genome browser visualization of data defined on different reference sequence collections. For more rationale and use cases of `sorted_name_length_pairs`, see [*Footnote F7*](#f7-use-cases-for-the-sorted_name_length_pairs-non-inherent-attribute). Algorithm: @@ -522,9 +649,11 @@ Algorithm: 2. Canonicalize JSON according to the seqcol spec (using RFC-8785). 3. Digest each name-length pair string individually. 4. Sort the digests lexicographically. -5. Add as a non-inherent, non-collated attribute to the sequence collection object. +5. Add to the sequence collection object. -#### 4.2 The `sorted_sequences` attribute (`OPTIONAL`) +The `sorted_name_length_pairs` attribute is: non-inherent, non-collated, non-passthru, and transient. + +#### 5.3 The `sorted_sequences` attribute (`OPTIONAL`) The `sorted_sequences` attribute is a *non-inherent* attribute of a sequence collection, with a formal definition. Providing this attribute is `OPTIONAL`. @@ -541,6 +670,10 @@ Algorithm: 2. Canonicalize the resulting array (using RFC-8785). 3. Add to the sequence collection object as the `sorted_sequences` attribute, which is non-inherent and non-collated. + + +--- + ## Footnotes ### F1. Why use an array-oriented structure instead of a sequence-oriented structure? @@ -557,51 +690,6 @@ While the latter is intuitive, as it captures each sequence object with some acc See [ADR on 2021-06-30 on array-oriented structure](decision_record.md#2021-06-30-use-array-based-data-structure-and-multi-tiered-digests) - -### F2. Collated attributes - -In JSON Schema, there are 2 ways to qualify properties: 1) a local qualifier, using a key under a property; or 2) an object-level qualifier, which is specified with a keyed list of properties up one level. -For example, you annotate a property's `type` with a local qualifier, underneath the property, like this: - -```console -properties: - names: - type: array -``` - -However, you specify that a property is `required` by adding it to an object-level `required` list that's parallel to the `properties` keyword: - -```console -properties: - names: - type: array -required: - - names -``` - -In sequence collections, we chose to define `collated` as a local qualifier. Local qualifiers fit better for qualifiers independent of the object as a whole. -They are qualities of a property that persist if the property were moved onto a different object. -For example, the `type` of an attribute is consistent, regardless of what object that attribute were defined on. -In contrast, object-level qualifier lists fit better for qualifiers that depend on the object as a whole. -They are qualities of a property that depend on the object context in which the property is defined. -For example, the `required` modifier is not really meaningful except in the context of the object as a whole. A particular property could be required for one object type, but not for another, and it's really the object that induces the requirement, not the property itself. - -We reasoned that `inherent`, like `required`, describes the role of an attribute in the context of the whole object; an attribute that is inherent to one type of object need not be inherent to another. -Therefore, it makes sense to treat this concept the same way JSON schema treats `required`. -In contrast, the idea of `collated` describes a property independently: Whether an attribute is collated is part of the definition of the attribute; if the attribute were moved to a different object, it would still be collated. - - -### F3. Details of inherent and non-inherent attributes - -The specification in section 1, *Encoding*, described how to structure a sequence collection and then apply an algorithm to compute a digest for it. -What if you have ancillary information that goes with a collection, but shouldn't contribute to the digest? -We have found a lot of useful use cases for information that should go along with a seqcol, but should not contribute to the *identity* of that seqcol. -This is a useful construct as it allows us to include information in a collection that does not affect the digest that is computed for that collection. -One simple example is the "author" or "uploader" of a reference sequence; this is useful information to store alongside this collection, but we wouldn't want the same collection with two different authors to have a different digest! Seqcol refers to these as *non-inherent attributes*, meaning they are not part of the core identity of the sequence collection. -Non-inherent attributes are defined in the seqcol schema, but excluded from the `inherent` list. - -See: [ADR on 2023-03-22 regarding inherent attributes](decision_record.md#2023-03-22-seqcol-schemas-must-specify-inherent-attributes) - ### F4. Sequence collections without sequences Typically, we think of a sequence collection as consisting of real sequences, but in fact, sequence collections can also be used to specify collections where the actual sequence content is irrelevant. From 54049bbf19c0b4ab8543c15572597ef39c592da1 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 20 Nov 2024 09:09:29 -0500 Subject: [PATCH 03/12] add details to new attribute qualifiers --- docs/seqcol.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/seqcol.md b/docs/seqcol.md index 64d6eeb..0048f08 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -203,7 +203,7 @@ The object is a series of arrays with matching length (`3`), with the correspond For the rationale why this structure was chosen instead of an array of annotated sequences, see [*Footnote F1*](#f1-why-use-an-array-oriented-structure-instead-of-a-sequence-oriented-structure). The implementation `MUST` define its structure in a JSON Schema, such as the example schema defined in step 1. Implementations `MAY` choose to extend this schema by adding additional attributes. -Implementations `MAY` also use a schema, but we `RECOMMEND` the schema extend the base schema defined above. +Implementations `MAY` also use a different schema, but we `RECOMMEND` the schema extend the base schema defined above. This schema extends vanilla JSON Schema in two ways; first, it provides the `collated` qualifier. Second, it specifies the `inherent` qualifier. For further details about attribute qualifiers, see [*Section 4*](#4-extending-the-schema-schema-attribute-qualifiers). @@ -485,6 +485,8 @@ Example return value: } ``` +The `list` endpoint MUST be implemented, and MUST allow filtering any attribute defined in the schema, *except attributes marked as passthru*. +For attributes marked as *passthru*, the list endpoint MAY provide filtering capability, but the spec is silent on this behavior because passthru attributes may be of types other than string. #### 3.5 Attribute From e92a86b4600378dac2e43138beaaacfb1f43e18b Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 20 Nov 2024 12:27:23 -0500 Subject: [PATCH 04/12] some renaming --- docs/seqcol.md | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/docs/seqcol.md b/docs/seqcol.md index 0048f08..490c3cf 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -465,9 +465,11 @@ For more details about how to interpret the results of the comparison function t #### 3.4 List -- *Endpoint*: `GET /list/:object_type?page=:page&page_size=:page_size&:attribute1=:attribute_digest1&attribute2=:attribute_digest2` (`REQUIRED`) -- *Description*: Lists identifiers for a given object type in singular form (*e.g.* `/list/collection`). This endpoint provides a way to discover what sequence collections a service provides. Returned lists can be filtered to only objects with certain attribute values using query parameters. Page numbering begins at page 0 (the first page is page 0). -- *Return value*: The output is a paged list of identifiers following the GA4GH paging guide format, grouped into a `results` and a `pagination` section. If no `?:attribute=:attribute_value` query parameters are provided, the endpoint will return all items (paged). Adding one or more `:attribute` and `:attribute_digest` values as *query parameters* will filter results to only the collections with the given attribute digest. If multiple attributes are provided, the filter should require ALL of these attributes to match (so multiple attributes are treated with an `AND` operator). +- *Endpoint*: `GET /list/:object_type?page=:page&page_size=:page_size&:attribute1=:attribute1_level1_repr&attribute2=:attribute2_level1_repr` (`REQUIRED`) +- *Description*: Lists identifiers for a given object type in singular form (*e.g.* `/list/collection`). This endpoint provides a way to discover what sequence collections a service provides. + Returned lists can be filtered to only objects with certain attribute values using query parameters. + Page numbering begins at page 0 (the first page is page 0). +- *Return value*: The output is a paged list of identifiers following the GA4GH paging guide format, grouped into a `results` and a `pagination` section. If no `?:attribute=:attribute_level1_repr` query parameters are provided, the endpoint will return all items (paged). Adding one or more `:attribute` and `:attribute_digest` values as *query parameters* will filter results to only the collections with the given attribute digest. If multiple attributes are provided, the filter should require ALL of these attributes to match (so multiple attributes are treated with an `AND` operator). Example return value: @@ -485,8 +487,8 @@ Example return value: } ``` -The `list` endpoint MUST be implemented, and MUST allow filtering any attribute defined in the schema, *except attributes marked as passthru*. -For attributes marked as *passthru*, the list endpoint MAY provide filtering capability, but the spec is silent on this behavior because passthru attributes may be of types other than string. +The `list` endpoint MUST be implemented, and MUST allow filtering using any attribute defined in the schema, *except attributes marked as passthru*. +For attributes marked as *passthru*, the list endpoint MAY provide filtering capability. #### 3.5 Attribute @@ -508,8 +510,7 @@ Example `/attribute/collection/names/:digest` return value: ``` The attribute endpoint MUST be functional for any attribute defined in the schema, *except those marked as transient or passthru*. -The endpoint MAY respond to requests for attributes marked as *passthru*. -The endpoint SHOULD NOT respond to requests for attributes marked as *transient*. +The endpoint SHOULD NOT respond to requests for attributes marked as *passthru* OR *transient*. For more information on transient and passthru attributes, see [Section 4](#4-extending-the-schema-schema-attribute-qualifiers). @@ -613,6 +614,18 @@ In contrast, the idea of `collated` describes a property independently: Whether Finally, the 3 global qualiers are grouped under the 'ga4gh' key for consistency with other GA4GH specifications, and to group the seqcol-specific extended functionality into one place. + + +The qualifiers are all about transitions from different representations. + +qualifer | level1 | level2 +*NONE* | digest | main representation +passthru | yes | same as level1 +transient | yes | N/A (not present) +inherent + + + ### 5. Ancillary attribute management: recommended non-inherent attributes In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. From 9e5664deb653bbc1637b832ec0ebf6bb9304b675 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 20 Nov 2024 12:47:40 -0500 Subject: [PATCH 05/12] add endpoint behavior for qualified attributes --- docs/seqcol.md | 57 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 38 insertions(+), 19 deletions(-) diff --git a/docs/seqcol.md b/docs/seqcol.md index 490c3cf..ac0c3b3 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -558,25 +558,54 @@ See: [ADR on 2023-03-22 regarding inherent attributes](decision_record.md#2023-0 #### 4.4 Passthru attributes -Passthru attributes are *not digested* in transition from level 2 to level 1. -In other words, the value of a passthru attribute is the same in the level 1 and level 2 representations +Passthru attributes have the same representation at level 1 and level 2. +In other words, they are *not digested* in transition from level 2 to level 1. This is not the case for most attributes; most attributes of the canonical (level 2) seqcol representation are digested to create the level 1 representation. But sometimes, we have an attribute for which digesting makes little sense. -These attributes are passed through the transformation, so they show up on the level 1 representation in the same form as the level 2 representation. +These attributes are passed through without transformation, so they show up on the level 1 representation in the same form as the level 2 representation. Thus, we refer to them as passthru attributes. +Here's how passthru attributes behave in the endpoints: +- `/list`: The server MAY allow filtering on passthru attributes, but this is not required. +- `/collection`: At both level 1 and level 2, the collection object includes the same passthru attribute representation. +- `/comparison`: Passthru attributes are listed in the 'attributes' section, but are not listed under 'array_elements'. +- `/attribute`: Passthru attributes cannot be used with the attribute endpoint, as the return would be the same as the query. + #### 4.5 Transient attributes -Transient attributes are those that *cannot be retrieved* through the `/attribute` endpoint. -Most attributes of the sequence collection can be retrieved through the `/attribute` endpoint. -However, some attributes may not be retrievable. -For example, this could happen for an attribute that we intend to be used primarily as an identifier. -In this case, we don't necessarily want to store the original content that went into the digest into the database, because it might be redundant or whatever. +A transient attribute is an attribute that only has a level 1 representation stored in the server. +Transient attributes therefore *cannot be retrieved* through the `/attribute` endpoint. +All other attributes of the sequence collection can be retrieved through the `/attribute` endpoint. +The transient qualifier would apply to attribute that we intend to be used primarily as an identifier. +In this case, we don't necessarily want to store the original content that went into the digest into the database. We really just want the final attribute. These attributes are called transient because the content of the attribute is no longer stored and is therefore no longer retrievable. -#### 4.6 Method of specifying attribute qualifiers +Here's how transient attributes behave in the endpoints: +- `/list`: No change; a transient attribute level1 representation can be used to list sequence collections that contain it. +- `/collection`: For level 1 representation, no change; the collection object includes the transient attribute level 1 representation. For level 2 representation, there *is* a change; transient attributes have no level 2 representation on the server, so the sequence collection SHOULD leave this attribute out of the level 2 representation. +- `/comparison`: Transient attributes are listed in the 'attributes' section, but are not listed under 'array_elements' because there is no level 2 representation. +- `/attribute`: Transient attributes cannot be used with the attribute endpoint (there is no value to retrieve) + + +#### 4.6 Qualifier summary table + +The global qualifiers are all concerned with how the representations are treated when converting between different detail levels. +The *inherent* qualifier is related to the level 1 → 0 transition. +It is true if the level 1 representation is included during creation of level 0 representation. +Then the *passthru* and *transient* qualifiers are related to the level 2 → 1 transition. + + +Qualifer | Level1? | Level2? | Notes +--- | ----- | ----- | ----- +*none* | as normal | full content | Default state; the level 1 representation is a digest of the level 2 representation. +passthru | as normal | same as level1 | True if the level 2 representation is the same as the level 1 representation. +transient | as normal | not present | True if the level 2 representation is not present. + + + +#### 4.7 Method of specifying attribute qualifiers In JSON Schema, there are 2 ways to qualify properties: 1) a local qualifier, using a key under a property; or 2) an object-level qualifier, which is specified with a keyed list of properties up one level. For example, you annotate a property's `type` with a local qualifier, underneath the property, like this: @@ -616,16 +645,6 @@ Finally, the 3 global qualiers are grouped under the 'ga4gh' key for consistency -The qualifiers are all about transitions from different representations. - -qualifer | level1 | level2 -*NONE* | digest | main representation -passthru | yes | same as level1 -transient | yes | N/A (not present) -inherent - - - ### 5. Ancillary attribute management: recommended non-inherent attributes In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. From cf90134692b4ef680483f8e8b55d7201ce6c0f42 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 20 Nov 2024 12:51:04 -0500 Subject: [PATCH 06/12] add qualifiers to custom attributes --- docs/seqcol.md | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/docs/seqcol.md b/docs/seqcol.md index ac0c3b3..9c1fc95 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -660,13 +660,19 @@ This attribute provides a way to look up the ordered coordinate system (the "chr It is created deterministically from the `names` and `lengths` attributes in the collection; it *does not* depend on the actual sequence content, so it is consistent across two collections with different sequence content if they have the same `names` and `lengths`, which are correctly collated. This attribute is `RECOMMENDED` to allow retrieval of the coordinate system for a given reference sequence collections. -Algorithm: +##### Algorithm 1. Lump together each name-length pair from the primary collated `names` and `lengths` into an object, like `{"length":123,"name":"chr1"}`. 2. Build a collated list, corresponing to the names and lengths of the object (*e.g.* `[{"length":123,"name":"chr1"},{"length":456,"name":"chr2"}],...`) 3. Add as a collated attribute to the sequence collection object. -The `name_length_pairs` attribute is *not inherent*, *not passthru*, and *not transient*. +##### Qualifiers + +- inherent: false +- collated: false +- passthru: false +- transient: false + #### 5.1 The `sorted_name_length_pairs` attribute (`RECOMMENDED`) @@ -677,7 +683,7 @@ but with pairs not necessarily in the same order. This attribute is `RECOMMENDED` to allow unified genome browser visualization of data defined on different reference sequence collections. For more rationale and use cases of `sorted_name_length_pairs`, see [*Footnote F7*](#f7-use-cases-for-the-sorted_name_length_pairs-non-inherent-attribute). -Algorithm: +##### Algorithm 1. Lump together each name-length pair from the primary collated `names` and `lengths` into an object, like `{"length":123,"name":"chr1"}`. 2. Canonicalize JSON according to the seqcol spec (using RFC-8785). @@ -685,7 +691,12 @@ Algorithm: 4. Sort the digests lexicographically. 5. Add to the sequence collection object. -The `sorted_name_length_pairs` attribute is: non-inherent, non-collated, non-passthru, and transient. +##### Qualifiers + +- inherent: false +- collated: false +- passthru: false +- transient: **true** #### 5.3 The `sorted_sequences` attribute (`OPTIONAL`) @@ -698,13 +709,18 @@ Simply that for some large-scale use cases, comparing the sequence content witho In these cases, using the comparison function could be computationally prohibitive. This digest allows the comparison to be pre-computed, and more easily compared. -Algorithm: +##### Algorithm 1. Take the array of the `sequences` attribute (an array of sequence digests) and sort it lexicographically. 2. Canonicalize the resulting array (using RFC-8785). 3. Add to the sequence collection object as the `sorted_sequences` attribute, which is non-inherent and non-collated. +##### Qualifiers +- inherent: false +- collated: false +- passthru: false +- transient: false --- From 81d3791b0dda2084967e7a60bebf4562ef755127 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 20 Nov 2024 12:52:31 -0500 Subject: [PATCH 07/12] name_length pairs is collated --- docs/seqcol.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/seqcol.md b/docs/seqcol.md index 9c1fc95..ea673bc 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -669,7 +669,7 @@ This attribute is `RECOMMENDED` to allow retrieval of the coordinate system for ##### Qualifiers - inherent: false -- collated: false +- collated: true - passthru: false - transient: false From 5e9d263fcc65dde29c60e6f2838a452b7738f8d6 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 20 Nov 2024 12:53:07 -0500 Subject: [PATCH 08/12] make qualifiers recommended --- docs/seqcol.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/seqcol.md b/docs/seqcol.md index ea673bc..43d01cb 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -666,7 +666,7 @@ This attribute is `RECOMMENDED` to allow retrieval of the coordinate system for 2. Build a collated list, corresponing to the names and lengths of the object (*e.g.* `[{"length":123,"name":"chr1"},{"length":456,"name":"chr2"}],...`) 3. Add as a collated attribute to the sequence collection object. -##### Qualifiers +##### Qualifiers (RECOMMENDED) - inherent: false - collated: true @@ -691,7 +691,7 @@ This attribute is `RECOMMENDED` to allow unified genome browser visualization of 4. Sort the digests lexicographically. 5. Add to the sequence collection object. -##### Qualifiers +##### Qualifiers (RECOMMENDED) - inherent: false - collated: false @@ -715,7 +715,7 @@ This digest allows the comparison to be pre-computed, and more easily compared. 2. Canonicalize the resulting array (using RFC-8785). 3. Add to the sequence collection object as the `sorted_sequences` attribute, which is non-inherent and non-collated. -##### Qualifiers +##### Qualifiers (RECOMMENDED) - inherent: false - collated: false From 20215caa96a17eecab8afeb7e22c92373fabebb6 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 20 Nov 2024 14:39:06 -0500 Subject: [PATCH 09/12] add decision on inherent schema location. See #84 --- docs/decision_record.md | 22 ++++++++++++ docs/seqcol.md | 73 ++++++++++++++-------------------------- docs/seqcol_rationale.md | 37 ++++++++++++++++++++ 3 files changed, 85 insertions(+), 47 deletions(-) diff --git a/docs/decision_record.md b/docs/decision_record.md index 1cbb184..b44d105 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -8,6 +8,23 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "S [TOC] +## 2024-11-20 Custom modifiers should live in the schema under the `ga4gh` key + +### Decision + +Any global custom modifiers should live under a `ga4gh` key in the schemea. Right now, this includes `inherent`, `transient`, and `passthru`. +Local modifiers (currently just `collated`) will continue to live, raw, under the attribute they describe. + + +### Rationale + +We want to follow the standard used in the other specs (VRS), and it also seems fine to have a place to lump together our custom modifiers. +We thought we could also do this for `collated`, as a local modifier, but opt not to right now because: there's only 1, it's a boolean, and it's not actually even used for anything in the spec at the moment, it is only there because it could be nice to use for a visualization of elements in a collection. The additional complexity of another layer just for this seems pointless at this point. + +### Linked issues + +- + ## 2024-11-13 Attributes can be designed as `passthru` or `transient`. ### Decision @@ -19,6 +36,11 @@ Transient attributes - Transient attributes are not retrievable from the attribute endpoint. Most attributes of the sequence collection can be retrieved through the /attribute endpoint. However, some attributes may not be retrievable. For example, this could happen for an attribute that we intend to be used primarily as an identifier. In this case, we don't necessarily want to store the original content that went into the digest into the database, because it might be redundant. We really just want the final attribute. These attributes are called transient because the content of the attribute is no longer stored and is therefore no longer retrievable. +Also, a few other related decisions we finalized: +- `collection` endpoint, level 2 collection representation should exclude transient attributes. +- `attribute` endpoint wouldn't provide anything for either transient or passthru attributes. +- Can passthru or transient attributes be inherent? They could, but it probably doesn't really make sense. Nevertheless, there's no reason to state that they cannot be. + ### Rationale As we worked on more advanced attributes, and with the addition of the `/attribute` endpoint, we realized these changes necessitate a bit more power for the schema to specify behavior of the attributes. For the basic seqcol attributes (names, lengths, sequences) and original endpoint, the general algorithm and basic qualifiers (required, inherent, collated) suffice to describe the representation. But some more nuanced attributes require additional qualifiers to describe their intention and how the server should be behave for the `/attribute` endpoint. For example, sorted_name_length_pairs and sorted_sequences are intended to provide alternative tailored identifiers and comparisons, and not necessarily useful for independent attribute lookup. Similarly, custom extra attributes, like author or alias, may be simple appendages that don't need the complex digesting procedure we use for the basic attributes. In order to flag such attributes in a way that can govern slightly different server expectations, we need a couple of additional advanced attribute qualifiers. For this purpose, we added the passthru and transient qualifiers. diff --git a/docs/seqcol.md b/docs/seqcol.md index 43d01cb..a3f2e73 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -31,7 +31,7 @@ In brief, the project specifies several procedures: 1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digests are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. 2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies an http API to retrieve the sequence collection given a digest. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible. -3. **Recommended ancillary, non-inherent attributes.** Finally, the protocol defines several recommended procedures that will improve the compatibility across Seqcol servers, and beyond. +3. **Recommended ancillary attributes.** Finally, the protocol defines several recommended procedures that will improve the compatibility across Seqcol servers, and beyond. ## Use cases @@ -153,7 +153,7 @@ We RECOMMEND that all implementations use this as a base schema, adding addition We RECOMMEND *not changing the inherent attributes list*, because this will keep the identifiers compatible across implementations. Implementations that use different inherent attributes are still compliant with the specification generally, but do so at the cost of top-level digest interoperability. -For more information about community-driven updates to the standard schema, see [*Footnote F8*](#f8-adding-new-schema-attributes). +For more information about community-driven updates to the standard schema, see [*Footnote F5*](#f5-adding-new-schema-attributes). ### 2. Encoding: Computing sequence digests from sequence collections @@ -168,7 +168,10 @@ The steps of the encoding process are: - **Step 4**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) again to canonicalize the JSON of the new seqcol object representation. - **Step 5**. Digest the final canonical representation again using the GA4GH digest algorithm. -Example Python code for computing a seqcol digest can be found in the [tutorial for computing seqcol digests](digest_from_collection.md). These steps are described in more detail below: +Example Python code for computing a seqcol digest can be found in the [tutorial for computing seqcol digests](digest_from_collection.md). +For information about the possibilty of deviating from this procedure for custom attributes, see [*Footnote F6*](#f6-custom-encoding-algorithms). + +These steps are described in more detail below: #### Step 1: Organize the sequence collection data into *canonical seqcol object representation*. @@ -239,12 +242,12 @@ b'["SQ.2648ae1bacce4ec4b6cf337dcae37816","SQ.907112d17fcb73bcab1ed1c72b97ce68"," _* The above Python function suffices if (1) attribute keys are restricted to ASCII, (2) there are no floating point values, and (3) for all integer values `i`: `-2**63 < i < 2**63`_ -Also, notice that in this process, RFC-8785 is applied only to objects; we assume the sequence digests are computed through an external process (the refget sequences protocol), and are not computed as part of the sequence collection. The refget sequences protocol digests sequence strings without JSON-canonicalization. For more details, see [*Footnote F5*](#f5-rfc-8785-does-not-apply-to-refget-sequences). +Also, notice that in this process, RFC-8785 is applied only to objects; we assume the sequence digests are computed through an external process (the refget sequences protocol), and are not computed as part of the sequence collection. The refget sequences protocol digests sequence strings without JSON-canonicalization. For more details, see [*Footnote F2*](#f2-rfc-8785-does-not-apply-to-refget-sequences). #### Step 3: Digest each canonicalized attribute value using the GA4GH digest algorithm. Apply the GA4GH digest algorithm to each attribute value. -The GA4GH digest algorithm is described in detail in [*Footnote F6*](#f6-the-ga4gh-digest-algorithm). +The GA4GH digest algorithm is described in detail in [*Footnote F3*](#f3-the-ga4gh-digest-algorithm). This converts the value of each attribute in the seqcol into a digest string. Applying this to each value will produce the following structure: @@ -645,7 +648,7 @@ Finally, the 3 global qualiers are grouped under the 'ga4gh' key for consistency -### 5. Ancillary attribute management: recommended non-inherent attributes +### 5. Recommended ancillary attributes In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. Non-inherent attributes provide a standardized way for implementations to store and serve additional, third-party attributes that do not contribute to the digest. @@ -658,7 +661,7 @@ Here, we specify standardized, useful non-inherent attributes that we recommend. The `name_length_pairs` attribute is a *non-inherent* attribute of a sequence collection with a formal definition, provided here. This attribute provides a way to look up the ordered coordinate system (the "chrom sizes") for a sequence collection. It is created deterministically from the `names` and `lengths` attributes in the collection; it *does not* depend on the actual sequence content, so it is consistent across two collections with different sequence content if they have the same `names` and `lengths`, which are correctly collated. -This attribute is `RECOMMENDED` to allow retrieval of the coordinate system for a given reference sequence collections. +This attribute is `RECOMMENDED` to allow retrieval of the coordinate system for a given reference sequence collection. ##### Algorithm @@ -681,7 +684,7 @@ When digested, this attribute provides a digest for an order-invariant coordinat Because it is *non-inherent*, it does not affect the identity (digest) of the collection. but with pairs not necessarily in the same order. -This attribute is `RECOMMENDED` to allow unified genome browser visualization of data defined on different reference sequence collections. For more rationale and use cases of `sorted_name_length_pairs`, see [*Footnote F7*](#f7-use-cases-for-the-sorted_name_length_pairs-non-inherent-attribute). +This attribute is `RECOMMENDED` to allow unified genome browser visualization of data defined on different reference sequence collections. For more rationale and use cases of `sorted_name_length_pairs`, see [*Footnote F4*](#f4-use-cases-for-the-sorted_name_length_pairs-non-inherent-attribute). ##### Algorithm @@ -740,45 +743,11 @@ While the latter is intuitive, as it captures each sequence object with some acc See [ADR on 2021-06-30 on array-oriented structure](decision_record.md#2021-06-30-use-array-based-data-structure-and-multi-tiered-digests) -### F4. Sequence collections without sequences - -Typically, we think of a sequence collection as consisting of real sequences, but in fact, sequence collections can also be used to specify collections where the actual sequence content is irrelevant. -Since this concept can be a bit abstract for those not familiar, we'll try here to explain the rationale and benefit of this. -First, consider that in a sequence comparison, for some use cases, we may be primarily concerned only with the *length* of the sequence, and not the actual sequence of characters. -For example, BED files provide start and end coordinates of genomic regions of interest, which are defined on a particular sequence. -On the surface, it seems that two genomic regions are only comparable if they are defined on the same sequence. -However, this not *strictly* true; in fact, really, as long as the underlying sequences are homologous, and the position in one sequence references an equivalent position in the other, then it makes sense to compare the coordinates. -In other words, even if the underlying sequences aren't *exactly* the same, as long as they represent something equivalent, then the coordinates can be compared. -A prerequisite for this is that the *lengths* of the sequence must match; it wouldn't make sense to compare position 5,673 on a sequence of length 8,000 against the same position on a sequence of length 9,000 because those positions don't clearly represent the same thing; but if the sequences have the same length and represent a homology statement, then it may be meaningful to compare the positions. - -We realized that we could gain a lot of power from the seqcol comparison function by comparing just the name and length vectors, which typically correspond to a coordinate system. -Thus, actual sequence content is optional for sequence collections. -We still think it's correct to refer to a sequence-content-less sequence collection as a "sequence collection" -- because it is still an abstract concept that *is* representing a collection of sequences: we know their names, and their lengths, we just don't care about the actual characters in the sequence in this case. -Thus, we can think of these as a sequence collection without sequence characters. - -An example of a canonical representation (level 2) of a sequence collection with unspecified sequences would be: - -``` -{ - "lengths": [ - "1216", - "970", - "1788" - ], - "names": [ - "A", - "B", - "C" - ] -} -``` - -### F5. RFC-8785 does not apply to refget sequences +### F2. RFC-8785 does not apply to refget sequences A note to clarify potential confusion with RFC-8785. While the sequence collection specification determines that RFC-8785 will be used to canonicalize the JSON before digesting, this is specific to sequence collections, it *does not apply to the original refget sequences protocol*. According to the sequences protocol, sequences are digested as un-quoted strings. If RFC-8785 were applied at the level of individual sequences, they would be quoted to become valid JSON, which would change the digest. Since the sequences protocol predated the sequence collections protocol, it did not use RFC-8785; and anyway, the sequences are just primitive types so a canonicalization scheme doesn't add anything. This leads to the slight confusion that RFC-8785 canonicalization is only applied to the objects in the sequence collections, and not to the primitives when the underlying sequences are digested. - -### F6. The GA4GH digest algorithm +### F3. The GA4GH digest algorithm The GA4GH digest algorithm, `sha512t24u`, was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This procedure is described as ([Hart _et al_. 2020](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883)): @@ -803,7 +772,7 @@ def sha512t24u_digest(seq: bytes) -> str: See: [ADR from 2023-01-25 on digest algorithm](decision_record.md#2023-01-25-digest-algorithm) -### F7. Use cases for the `sorted_name_length_pairs` non-inherent attribute +### F4. Use cases for the `sorted_name_length_pairs` non-inherent attribute One motivation for this attribute comes from genome browsers, which may display genomic loci of interest (*e.g.* BED files). The genome browser should only show BED files if they annotate the same coordinate system as the reference genome. @@ -820,7 +789,7 @@ Thus, in a production setting, the full compatibility check can be reduced to a See: [ADR from 2023-07-12 on sorted name-length pairs](decision_record.md#2023-07-12-implementations-should-provide-sorted_name_length_pairs-and-comparison-endpoint) -### F8. Adding new schema attributes +### F5. Adding new schema attributes A strength of the seqcol standard is that the schema definition can be modified for particular use cases, for example, by adding new attributes into a sequence collection. This will allow different communities to use the standard without necessarily needing to subscribe to identical schemas, allowing the standard to be more generally useful. @@ -832,4 +801,14 @@ The goal is not to include all possible attributes in the schema, but just those An implementation may propose a new attribute to be added to this extended schema by raising an issue on the GitHub repository. The proposed attributes and definition can then be approved through discussion during the refget working group calls and ultimately added to the approved extended seqcol schema. These GitHub issues should be created with the label 'schema-term'. -You can follow these issues (or raise your own) at . \ No newline at end of file +You can follow these issues (or raise your own) at . + +### F6. Custom encoding algorithms + +A core part of Sequence Collections specification is the *encoding* algorithm, which describes how to create the digest for a sequence collection. +The encoding process can be divided into two steps; first, the attributes are encoded into the level 1 representation, and then this is encoded to produce the final digest (also called the level 0 or top level representation). +The first part of this process, encoding from level 2 to level 1, is the default; this is applied to any attributes that don't have something else defined specifically as part of the attribute definition. +This is the way all the minimal attributes (names, lengths, and sequences) should behave. +But custom attributes MAY diverge from this approach by defining their own encoding procedure that defines how the level 1 digest is computed from the level 2 representation. +For example, in the list of recommended ancillary attributes, `name_length_pairs` does not define a custom procedure for encoding, so this would follow the default procedure. +An alternative custom attribute, though, MAY specify how this encoding procedure happens. diff --git a/docs/seqcol_rationale.md b/docs/seqcol_rationale.md index f85d072..7797c4c 100644 --- a/docs/seqcol_rationale.md +++ b/docs/seqcol_rationale.md @@ -82,3 +82,40 @@ One final important point. Sometimes people seeing the standard for the first ti For reasons outlined in the specification, for the actual computing of the identifier, it's important to use the array-based structure -- this is what enables us to use the "level 1" digests for certain comparison questions, and also provides critical performance benefits for extremely large sequence collections. But don't let this dissuade you! My critical point is this: *the way to compute the interoperable identifier does not force you to structure your data in a certain way in your service* -- it's simply the way you structure the data when you compute its identifier. You are, of course, free to store a collection however you want, in whatever structure makes sense for you. You'd just need to structure it according to the standard if you wanted to implement the algorithm for computing the digest. In fact, my implementation provides a way to retrieve the collection information in either structure. + + + + + +### Sequence collections without sequences + +Typically, we think of a sequence collection as consisting of real sequences, but in fact, sequence collections can also be used to specify collections where the actual sequence content is irrelevant. +Since this concept can be a bit abstract for those not familiar, we'll try here to explain the rationale and benefit of this. +First, consider that in a sequence comparison, for some use cases, we may be primarily concerned only with the *length* of the sequence, and not the actual sequence of characters. +For example, BED files provide start and end coordinates of genomic regions of interest, which are defined on a particular sequence. +On the surface, it seems that two genomic regions are only comparable if they are defined on the same sequence. +However, this not *strictly* true; in fact, really, as long as the underlying sequences are homologous, and the position in one sequence references an equivalent position in the other, then it makes sense to compare the coordinates. +In other words, even if the underlying sequences aren't *exactly* the same, as long as they represent something equivalent, then the coordinates can be compared. +A prerequisite for this is that the *lengths* of the sequence must match; it wouldn't make sense to compare position 5,673 on a sequence of length 8,000 against the same position on a sequence of length 9,000 because those positions don't clearly represent the same thing; but if the sequences have the same length and represent a homology statement, then it may be meaningful to compare the positions. + +We realized that we could gain a lot of power from the seqcol comparison function by comparing just the name and length vectors, which typically correspond to a coordinate system. +Thus, actual sequence content is optional for sequence collections. +We still think it's correct to refer to a sequence-content-less sequence collection as a "sequence collection" -- because it is still an abstract concept that *is* representing a collection of sequences: we know their names, and their lengths, we just don't care about the actual characters in the sequence in this case. +Thus, we can think of these as a sequence collection without sequence characters. + +An example of a canonical representation (level 2) of a sequence collection with unspecified sequences would be: + +``` +{ + "lengths": [ + "1216", + "970", + "1788" + ], + "names": [ + "A", + "B", + "C" + ] +} +``` \ No newline at end of file From b3defda6126b9c3857e87e7369d77def5196d3ae Mon Sep 17 00:00:00 2001 From: nsheff Date: Tue, 10 Dec 2024 12:49:15 -0500 Subject: [PATCH 10/12] minor cleanup --- docs/README.md | 22 +++++++++++----------- docs/contributing.md | 2 +- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/README.md b/docs/README.md index 0d44da8..448a2d2 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,25 +1,25 @@ -# Refget - -Unique identifiers and lookup service for reference sequences and sequence collections. - -Refget abstract - +# Refget specifications ## What is refget? +Refget is a protocol for identifying and distributing reference biological sequences. +It currently consists of 2 standards: -Refget is a protocol for identifying and distributing biological sequence references. It currently consists of 2 standards: +1. [Refget sequences](sequences.md): a GA4GH-approved standard for individual sequences +2. [Refget sequence collections](seqcol.md): a standard for collections of sequences, under review + +Refget abstract -1. Refget sequences: a GA4GH-approved standard for individual sequences -2. Refget sequence collections: a standard for collections of sequences, under review ## What is the refget sequences standard? -The original refget handled sequences only. Refget enables access to reference sequences using an identifier derived from the sequence itself. +The original refget standard, now called *Refget sequences*, handles sequences only. +Refget sequences enables access to reference sequences using an identifier derived from the sequence itself. + ## What is the refget sequence collections standard? -*Sequence Collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides: +*Refget sequence collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides: - implementations of an algorithm for computing sequence identifiers; - a lookup service to retrieve sequences given a seqcol identifier diff --git a/docs/contributing.md b/docs/contributing.md index aacfaa0..b833dcc 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -4,7 +4,7 @@ We welcome more participants! If you are interested in contributing, one of the ## Maintainers -- Nathan Sheffield, Center for Public Health Genomics, University of Virginia +- Nathan Sheffield, Department of Genome Sciences, University of Virginia - Andy Yates, EMBL-EBI - Timothee Cezard, EMBL-EBI From 08372d9a08ac5e93bb5436475a40956901d913f0 Mon Sep 17 00:00:00 2001 From: nsheff Date: Tue, 10 Dec 2024 16:45:08 -0500 Subject: [PATCH 11/12] add decision that level 2 excludes transient attrs --- docs/decision_record.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/decision_record.md b/docs/decision_record.md index b44d105..28014bb 100644 --- a/docs/decision_record.md +++ b/docs/decision_record.md @@ -8,6 +8,21 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "S [TOC] +## 2024-11-20 Level 2 return values should not return transient attributes + +### Decision + +Level 2 return values should not return transient attributes + +### Rationale + +We debated whether the `/collection?level=2` endpoint should do with transient attributes, because the level 2 representations are not stored. One train of thought was that it could return the level 1 representation; other is that it just includes nothing. We decided that the more pure approach would be include neither + +Another option was something like `?level=highest`, which would return level 2 representations for everything that has one, but level 1 representations for transient attributes. + +We decided that even if you don't have that information, you could just get it from the `?level=1` endpoint. Or, implementations could specify their own way + + ## 2024-11-20 Custom modifiers should live in the schema under the `ga4gh` key ### Decision From f61021c074c87cc342f64fa27b2e795dfb5b0d16 Mon Sep 17 00:00:00 2001 From: nsheff Date: Wed, 11 Dec 2024 10:01:23 -0500 Subject: [PATCH 12/12] some clarifications --- docs/seqcol.md | 203 ++++++++++++++++++++++++++----------------------- mkdocs.yml | 1 + 2 files changed, 110 insertions(+), 94 deletions(-) diff --git a/docs/seqcol.md b/docs/seqcol.md index a3f2e73..eda99f2 100644 --- a/docs/seqcol.md +++ b/docs/seqcol.md @@ -25,42 +25,46 @@ Reference sequences are fundamental to genomic analysis. To make their analysis reproducible and efficient, we require tools that can identify, store, retrieve, and compare reference sequences. The primary goal of the *Sequence Collections* (seqcol) project is **to standardize identifiers for collections of sequences**. Seqcol can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. -A common example and primary use case of sequence collections is for reference genome, so this documentation sometimes refers to reference genomes for convenience; really, it can be applied to any collection of sequences. +A common example and primary use case of sequence collections is for a reference genome, so this documentation sometimes refers to reference genomes for convenience; really, it can be applied to any collection of sequences. In brief, the project specifies several procedures: -1. **An algorithm for encoding sequence collection identifiers.** The GA4GH standard [refget sequences](http://samtools.github.io/hts-specs/refget.html) specifies a way to compute deterministic sequence identifiers from individual sequences. Seqcol uses refget sequence identifiers and adds functionality to wrap them into collections of sequences. Seqcol also handles sequence attributes, such as their names, lengths, or topologies. Seqcol digests are defined by a hash algorithm, rather than an accession authority, and are thus decentralized and usable for private sequence collections, cases without connection to a central database, or validation of sequence collection content and provenance. -2. **An API describing lookup and comparison of sequence collections.** Seqcol specifies an http API to retrieve the sequence collection given a digest. A main use case is to reproduce the exact sequence collection (*e.g.* reference genome) used for analysis, instead of guessing based on a human-readable identifier. Seqcol also provides a standardized method of comparing the contents of two sequence collections. This comparison function can *e.g.* be used to determine if analysis results based on different references genomes are compatible. -3. **Recommended ancillary attributes.** Finally, the protocol defines several recommended procedures that will improve the compatibility across Seqcol servers, and beyond. +1. **An algorithm for encoding sequence collection identifiers.** +Refget Sequence Collections extends [Refget Sequences](sequences.md) to collections of sequences. +Seqcol also handles sequence attributes, such as their names, lengths, or topologies. +Like Refget sequences, seqcol digests are defined by a hash algorithm, rather than an accession authority. +2. **An API describing lookup and comparison of sequence collections.** +Seqcol specifies an http API to retrieve a sequence collection given its digest. +This can be used to reproduce the exact sequence collection instead of guessing based on a human-readable identifier. +Seqcol also provides a standardized method of comparing the contents of two sequence collections. +3. **Recommended ancillary attributes.** +Finally, the protocol defines several recommended procedures that will improve compatibility across Seqcol servers, and beyond. ## Use cases -Sequence collections represent fundamental concepts; therefore the specification can be used for many use cases. -A primary goal is that that seqcol digests could replace or live alongside the human-readable identifiers currently used to identify reference genomes (*e.g.* "hg38" or "GRCh38"). -Reference genomes are an indispensable resource for genome analysis. -Such reference data is provided in many versions by various sources. -Unfortunately, this reference variation leads to fundamental problems in analysis of reference genomes: computational results are often irreproducible or incompatible because reference genome data they use is either not matching or unidentifiable. -These issues are partially caused by our tradition of simple human-readable reference identifiers; this is sub-optimal because such identifiers can refer to references with subtle (or not so subtle) differences, undermining the utility of the identifiers, as is well-known for "hg38" or "GRCh38" monikers. -One solution is to use unique identifiers that unambiguously identify a particular assembly, such as those provided by the NCBI Assembly database; however, this approach relies on a central authority, and therefore can not apply to custom genomes. -Another weakness of centralized unique identifiers is that they are insufficient to *confirm* identity, which must also consider the content of the genome. -A related problem is determining compatibility among reference genomes. -Analytical results based on different genome references may still be integrable, as long as certain conditions about those references are met. -However, there are no existing tools or standards to formalize and simplify answering the question of reference genome compatibility. - -An earlier standard, the refget sequences protocol, partially addressed this issue for individual sequences, such as a single chromosome, but is not directly applicable to collections of sequences, such as a linear reference genome. -Building on refget sequences, sequence collections presents fundamental concepts, and therefore the specification can be used for many use cases. -For example, we envision that seqcol identifiers could replace or live alongside the human-readable identifiers currently used to identify reference genomes (e.g. "hg38" or "GRCh38"), which would provide improved reproducibility. - -Some other examples of common use cases where the use of seqcol is beneficial include: - -- As a user, I wish to know what sequences are inside a specific collection, so that I can further access those sequences -- As a user, I want to compare the two sequence collections used by two separate analyses so I can understand how comparable and compatible their resulting data are. -- As a user, I am interested in a genome sequence collection but want to extract those sequences which compose the chromosomes/karyotype of a genome -- As a submission system, I want to know what exactly a sequence collection contains so I can validate a data file submission. -- As a software developer, I want to embed a sequence collection digest in my tool's output so that downstream tools can identify the exact sequence collection that was used -- I have a chromosome sizes file (a set of lengths and names), and I want to ask whether a given sequence collection is length-compatible with and/or name-compatible with this chromosome sizes file. -- As a genome browser, I have one sequence collection that the coordinate system displayed, and I want to know if a digest representing the coordinate system of a given BED file is compatible with the genome browser. -- As a data processor, my input data didn't include information about the reference genome used, and I want to generate the sequence collection digest and attach it so that further processing can benefit from the sequence collection features. +Sequence collections represent fundamental concepts, making the specification adaptable to a wide range of use cases. +A primary goal is to enable sequence collection (seqcol) digests to replace or complement the human-readable identifiers currently used for reference genomes (e.g., "hg38" or "GRCh38"). +Unfortunately, these simple identifiers often refer to references with subtle (or not so subtle) differences. Such variation leads to fundamental issues in analyses relying on reference genomes, undermining the utility of these identifiers. + +Unique identifiers, such as those provided by the NCBI Assembly database, partially address this problem by unambiguously identifying specific assemblies. However, this approach has limitations: + +- It depends on a central authority, which excludes custom genomes and doesn't cover all reference providers. +- Centralized identifiers alone cannot *confirm* identity, as identity also depends on the genome's content. +- It does not address the related challenge of determining compatibility among reference genomes. Analytical results or annotations based on different references may still be integrable if certain conditions are met, but current tools and standards lack the means to formalize and simplify compatibility comparisons. + +The refget sequences protocol provides a partial solution applicable to individual sequences, such as a single chromosome. +However, refget does not directly address collections of sequences, such as a linear reference genome. +Building on refget, the sequence collections specification introduces foundational concepts that support diverse use cases, including: + +- **Accessing sequences**: *As a data analyst, I want to know which sequences are in a specific collection so I can analyze them further.* +- **Comparing collections**: *As a data analyst, I want to compare the sequence collections used in two separate analyses to assess the compatibility of their resulting data.* +- **Annotation curation**: *As a data curator for SNP data, I want an unambiguous reference genome identifier upon which my SNP annotations can be interpreted, so I can compare them with confidence*. +- **Extracting subsets**: *As a data analyst, I want to extract specific sequences, such as those composing the chromosomes or karyotype of a genome.* +- **Validating submissions**: *As a submission system, I need to determine the exact content of a sequence collection to validate data file submissions.* +- **Embedding identifiers**: *As a software developer, I want to embed a sequence collection identifier in my tool's output, allowing downstream tools to identify the exact sequence collection used.* +- **Checking compatibility**: *As a data analyst using published data, I have a chromosome sizes file (a set of lengths and names) and want to determine whether a given sequence collection is length- or name-compatible with this file.* +- **Genome browser integration**: *As a genome browser, I use one sequence collection for the displayed coordinate system and want to check if a digest representing a given BED file's coordinate system is compatible with it.* +- **Annotating unknown references**: *As a data processor, I encounter input data without reference genome information and want to generate a sequence collection digest to attach, enabling further processing with seqcol features.* ## Definitions of key terms @@ -71,22 +75,25 @@ Some other examples of common use cases where the use of seqcol is beneficial in - **Coordinate system**: An ordered list of named sequence lengths, but without actual sequences. - **Digest**: A string resulting from a cryptographic hash function, such as `MD5` or `SHA512`, on input data. - **Length**: The number of characters in a sequence. +- **Level**: A way of specifying the completeness of a sequence collection representation. Level 0 is the simplest representation, level 1 more complete, level 2 even more complete, and so forth. Representation levels are described in detail under [terminology](#terminology). - **Qualifier**: A reserved term used in the schema to indicate a quality of an attribute, such as whether it is required, collated, or inherent. Qualifiers are listed below. - **Seqcol algorithm**: The set of instructions used to compute a digest from a sequence collection. - **Seqcol API**: The set of endpoints defined in the *retrieval* and *comparison* components of the seqcol protocol. - **Seqcol digest**: A digest for a sequence collection, computed according to the seqcol algorithm. -- **Seqcol protocol**: Collectively, the 3 operations outlined in this document, which include: 1. encoding of sequence collections; 2. API describing retrieval and comparison ; and 3. specifications for ancillary recommended attributes. +- **Seqcol protocol**: Collectively, the operations outlined in this document, which include: 1. encoding of sequence collections; 2. API describing retrieval and comparison ; and 3. specifications for ancillary recommended attributes. - **Sequence**: Seqcol uses refget sequences to identify actual sequences, so we generally use the term "sequence" in the same way. Refget sequences was designed for nucleotide sequences; however, other sequences could be provided via the same mechanism, *e.g.*, cDNA, CDS, mRNA or proteins. Essentially any ordered list of refget-sequences-valid characters qualifies. Sequence collections also goes further, since sequence collections may contain sequences of non-specified characters, which therefore have a length but no actual sequence content. - **Sequence digest** or **refget sequence digest**: A digest for a sequence, computed according to the refget sequence protocol. -- **Sequence collection**: A representation of 1 or more sequences that is structured according to the sequence collection schema +- **Sequence collection**: A representation of 1 or more sequences that is structured according to the sequence collection schema. - **Sequence collection attribute**: A property or feature of a sequence collection (*e.g.* names, lengths, sequences, or topologies). ### Attribute qualifiers -- **Collated**: A qualifier applied to a seqcol attribute indicating that the values of the attribute match 1-to-1 with the sequences in the collection and are represented in the same order. -- **Inherent**: A qualifier applied to a seqcol attribute indicating that the attribute is part of the definition of the sequence collection and therefore contributes to its digest. -- **Passthru**: A qualifier applied to a seqcol attribute indicating that the attribute is *not digested* in transition from level 2 to level 1. So its value on level 1 representation the same as the level 2 representation. -- **Transient**: A qualifier applied to a seqcol attribute indicating that the attribute *cannot be retrieved through the `/attribute` endpoint*. +These qualifiers apply to a seqcol attribute. These definitions specify something about the attribute if the qualifier is true: + +- **Collated**: the values of the attribute match 1-to-1 with the sequences in the collection and are represented in the same order. +- **Inherent**: the attribute is part of the definition of the sequence collection and therefore contributes to its digest. +- **Passthru**: the attribute is *not digested* in transition from level 2 to level 1. So its value at level 1 is the same as at level 2. +- **Transient**: the attribute *cannot be retrieved through the `/attribute` endpoint*. ## Seqcol protocol functionality @@ -103,11 +110,15 @@ The seqcol protocol defines the following: ### 1. Schema: Defining the attributes in the collection -The first step for a Sequence Collections implementation is to define the *list of contents*, that is, what attributes are allowed in each collection, and which of these affect the digest. +The first step for a Sequence Collections implementation is to define the *list of contents*, that is, what attributes are allowed in a collection, and which of these affect the digest. The sequence collections standard is flexible with respect to the schema used, so implementations of the standard can use the standard with different schemas, as required by a particular use case. This divides the choice of content from the choice of algorithm, allowing the algorithm to be consistent even in situations where the content is not. +Nevertheless, we RECOMMEND that all implementations start from the same base schema, and add additional attributes as needed, which will not break interoperability. +We RECOMMEND *not changing the inherent attributes list*, because this will keep the identifiers compatible across implementations. +Implementations that use different inherent attributes are still compliant with the specification generally, but do so at the cost of top-level digest interoperability. +For more information about community-driven updates to the base schema, see [*Footnote F5*](#f5-adding-new-schema-attributes). -This is an example of a general, minimal schema: +This is the RECOMMENDED minimal base schema: ```YAML description: "A collection of biological sequences." @@ -147,20 +158,23 @@ ga4gh: - sequences ``` -This example schema is the minimal standard schema. -Sequence collection objects that follow this basic minimal structure are said to be the *canonical seqcol object representation*. -We RECOMMEND that all implementations use this as a base schema, adding additional attributes as needed, which will not break interoperability. -We RECOMMEND *not changing the inherent attributes list*, because this will keep the identifiers compatible across implementations. -Implementations that use different inherent attributes are still compliant with the specification generally, but do so at the cost of top-level digest interoperability. +Sequence collection objects that follow the base minimal structure are said to be the *canonical seqcol object representation*. +The implementation `MUST` define its structure in a JSON Schema, such as this example. +Implementations `MAY` choose to extend this schema by adding additional attributes. +Implementations `MAY` also use a different schema, but we `RECOMMEND` the schema extend the base schema defined above. -For more information about community-driven updates to the standard schema, see [*Footnote F5*](#f5-adding-new-schema-attributes). +This schema extends vanilla JSON Schema with a few Seqcol-specific *attribute qualifiers*: the `collated` and `inherent` qualifiers. +The specification also defines other attribute qualifiers that are not used in the base schema. +For further details about attribute qualifiers, see [*Section 4*](#4-extending-the-schema-schema-attribute-qualifiers). ### 2. Encoding: Computing sequence digests from sequence collections -The encoding function is the algorithm that produces a unique digest for the sequence collection. +The encoding function is the algorithm that produces a unique digest for a sequence collection. The input to the function is a set of annotated sequences. This function is generally expected to be provided by local software that operates on a local set of sequences. -The steps of the encoding process are: +Example Python code for computing a seqcol digest can be found in the [tutorial for computing seqcol digests](digest_from_collection.md). +For information about the possibilty of deviating from this procedure for custom attributes, see [*Footnote F6*](#f6-custom-encoding-algorithms). +The steps of the encoding process are described in detail below; briefly, the steps are: - **Step 1**. Organize the sequence collection data into *canonical seqcol object representation* and filter the non-inherent attributes. - **Step 2**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) (JCS) to canonicalize the value associated with each attribute individually. @@ -168,10 +182,7 @@ The steps of the encoding process are: - **Step 4**. Apply [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) again to canonicalize the JSON of the new seqcol object representation. - **Step 5**. Digest the final canonical representation again using the GA4GH digest algorithm. -Example Python code for computing a seqcol digest can be found in the [tutorial for computing seqcol digests](digest_from_collection.md). -For information about the possibilty of deviating from this procedure for custom attributes, see [*Footnote F6*](#f6-custom-encoding-algorithms). -These steps are described in more detail below: #### Step 1: Organize the sequence collection data into *canonical seqcol object representation*. @@ -204,20 +215,15 @@ Here's an example of a sequence collection organized into the canonical seqcol o This object would validate against the JSON Schema above. The object is a series of arrays with matching length (`3`), with the corresponding entries collated such that the first element of each array corresponds to the first element of each other array. For the rationale why this structure was chosen instead of an array of annotated sequences, see [*Footnote F1*](#f1-why-use-an-array-oriented-structure-instead-of-a-sequence-oriented-structure). -The implementation `MUST` define its structure in a JSON Schema, such as the example schema defined in step 1. -Implementations `MAY` choose to extend this schema by adding additional attributes. -Implementations `MAY` also use a different schema, but we `RECOMMEND` the schema extend the base schema defined above. -This schema extends vanilla JSON Schema in two ways; first, it provides the `collated` qualifier. -Second, it specifies the `inherent` qualifier. -For further details about attribute qualifiers, see [*Section 4*](#4-extending-the-schema-schema-attribute-qualifiers). -##### Filter non-inherent attributes -The `inherent` section in the seqcol schema is an extension of the basic JSON Schema format that adds specific functionality. -Inherent attributes are those that contribute to the digest; *non-inherent* attributes are not considered when computing the top-level digest. -Attributes of a seqcol that are *not* listed as `inherent` `MUST NOT` contribute to the digest; they are therefore excluded from the digest calculation. -Therefore, if the canonical seqcol representation includes any non-inherent attributes, these must be removed before proceeding to step 2. -In the simple example, there are no non-inherent attributes. +!!! warning "Filter non-inherent attributes" + + The `inherent` section in the seqcol schema is an extension of the basic JSON Schema format that adds specific functionality. + Inherent attributes are those that contribute to the digest; *non-inherent* attributes are not considered when computing the top-level digest. + Attributes of a seqcol that are *not* listed as `inherent` `MUST NOT` contribute to the digest; they are therefore excluded from the digest calculation. + Therefore, if the canonical seqcol representation includes any non-inherent attributes, these must be removed before proceeding to step 2. + In the simple example, there are no non-inherent attributes. #### Step 2: Apply RFC-8785 to canonicalize the value associated with each attribute individually. @@ -242,7 +248,9 @@ b'["SQ.2648ae1bacce4ec4b6cf337dcae37816","SQ.907112d17fcb73bcab1ed1c72b97ce68"," _* The above Python function suffices if (1) attribute keys are restricted to ASCII, (2) there are no floating point values, and (3) for all integer values `i`: `-2**63 < i < 2**63`_ -Also, notice that in this process, RFC-8785 is applied only to objects; we assume the sequence digests are computed through an external process (the refget sequences protocol), and are not computed as part of the sequence collection. The refget sequences protocol digests sequence strings without JSON-canonicalization. For more details, see [*Footnote F2*](#f2-rfc-8785-does-not-apply-to-refget-sequences). +In this process, RFC-8785 is applied only to objects; we assume the sequence digests are computed through an external process (the refget sequences protocol), and are not computed as part of the sequence collection. +The refget sequences protocol digests sequence strings without JSON-canonicalization. +For more details, see [*Footnote F2*](#f2-rfc-8785-does-not-apply-to-refget-sequences). #### Step 3: Digest each canonicalized attribute value using the GA4GH digest algorithm. @@ -259,6 +267,13 @@ Applying this to each value will produce the following structure: } ``` +!!! warning "Exception for passthru attributes" + + This digesting procedure (Step 3) is applied by default to all attributes of a sequence collection, *except for attributes qualified in the schema as passthru*; these attributes are NOT digested in this way. + Typically, we passthru attributes would also not be inherent, and are therefore filtered before this step anyway, but for a rare case of an inherent passthru attribute, this digest would not happen. + For more information about passthru attributes, see [Section 4](#4-extending-the-schema-schema-attribute-qualifiers). + + #### Step 4: Apply RFC-8785 again to canonicalize the JSON of the new seqcol object representation. Here, we repeat step 2, except instead of applying RFC-8785 to each value separately, we apply it to the entire object. @@ -277,17 +292,11 @@ The result is the final unique digest for this sequence collection: wqet7IWbw2j2lmGuoKCaFlYS_R7szczz ``` - -#### Exception for passthru attributes - -The above canonicalization/digesting procedure is applied by default to all attributes of a sequence collection; however, there can be some exceptions. -Any attribute qualified in the schema as a *passthru* attribute is NOT digested in this way. - #### Terminology -Because the encoding algorithm is recursive, this leads to a few different ways to represent a sequence collection. We refer to these representations in "levels". The level number represents the number of "lookups" you'd have to do from the "top level" digest. So, we have: +The recursive encoding algorithm leads to several ways to represent a sequence collection. We refer to these representations as "levels". The level number represents the number of "lookups" you'd have to do from the "top level" digest. So, we have: -##### Level 0 +##### Level 0 (top-level digest) Just a plain digest, also known as the "top-level digest". This corresponds to **0 database lookups**. Example: ``` @@ -296,8 +305,9 @@ a6748aa0f6a1e165f871dbed5e54ba62 ##### Level 1 -What you'd get when you look up the digest with **1 database lookup** (no recursion). We sometimes refer to this as the "array digests" or "attribute digests", because it is made up a digest for each attribute in the sequence collection. Example: -``` +What you'd get when you look up the digest with **1 database lookup**. We sometimes refer to this as the "array digests" or "attribute digests", because it is made up a digest for each attribute in the sequence collection. Example: + +```json { "lengths": "4925cdbd780a71e332d13145141863c1", "names": "ce04be1226e56f48da55b6c130d45b94", @@ -307,9 +317,9 @@ What you'd get when you look up the digest with **1 database lookup** (no recurs ##### Level 2 -What you'd get with **2 database lookups** (1 recursive call). This is the most common representation, and hence, it the level of the *canonical seqcol representation*. Example: +What you'd get with **2 database lookups**. This is the most common representation, and hence, it the level of the *canonical seqcol representation*. Example: -``` +```json { "lengths": [ "1216", @@ -334,10 +344,10 @@ What you'd get with **2 database lookups** (1 recursive call). This is the most The API has these top-level endpoints: -1. `/service-info`, for describing information about the service; -2. `/collection`, for retrieving sequence collections; and +1. `/service-info`, for describing information about the service. +2. `/collection`, for retrieving sequence collections. 3. `/comparison`, for comparing two sequence collections. -4. `/list`, for retriving a list of objects; and +4. `/list`, for retriving a list of objects. 5. `/attribute`, for retriving the value of a specific attribute. In addition, a RECOMMENDED endpoint at `/openapi.json` SHOULD provide OpenAPI documentation. @@ -353,6 +363,7 @@ Under these umbrella endpoints are a few more specific sub-endpoints, described The `/service-info` endpoint should follow the [GA4GH-wide specification for service info](https://github.com/ga4gh-discovery/ga4gh-service-info/) for general description of the service. Then, it should also add a few specific pieces of information under a `seqcol` property: + - `schema`: MUST return the JSON Schema implemented by the server. ##### The service-info JSON-schema document @@ -361,7 +372,7 @@ The `schema` attribute of `service-info` return value MUST provide a single sche We RECOMMEND the schema only define terms actually used in at least one collection served; however, it is allowed for the schema to contain extra terms that are not used in any collections in the server. -We RECOMMEND your schema use property-level refs to point to terms defined by a central, approved seqcol schema. However, it is also allowed for the schema to embed all definitions locally. The central, approved seqcol schema will be made available as the spec is finalized. +We RECOMMEND your schema use property-level refs to point to terms defined by the minimal base seqcol schema. However, it is also allowed for the schema to embed all definitions locally. The base seqcol schema will be made available as the spec is finalized. For example, here's a JSON schema that uses a `ref` to reference the approved seqcol schema: @@ -378,12 +389,11 @@ properties: "$ref": "/sequences" required: - names - - lengths - sequences ga4gh: inherent: - - sequences - names + - sequences ``` @@ -407,7 +417,7 @@ Non-inherent attributes `MUST` be stored and returned by the collection endpoint Example `/comparison` return value: -``` +```json { "digests": { "a": "514c871928a74885ce981faa61ccbb1a", @@ -426,13 +436,13 @@ Example `/comparison` return value: "a_count": { "lengths": 195, "names": 195, - "sequences: 195 + "sequences": 195 }, "b_count": { "lengths": 25, "names": 25, - "sequences: 25 - } + "sequences": 25 + }, "a_and_b_count": { "lengths": 25, "names": 25, @@ -477,7 +487,7 @@ For more details about how to interpret the results of the comparison function t Example return value: -``` +```json { "results": [ ... @@ -517,9 +527,9 @@ The endpoint SHOULD NOT respond to requests for attributes marked as *passthru* For more information on transient and passthru attributes, see [Section 4](#4-extending-the-schema-schema-attribute-qualifiers). -##### Definition of `object_type` +!!! note "Definition of `object_type`" -The `/list` and `/attribute` endpoints both use an `:object_type` path parameter. The `object_type` should always be the *singular* descriptor of objects provided by the server. In this version of the Sequence Collections specification, the `object_type` is always `collection`; so the only allowable endpoints would be `/list/collection` and `/attribute/collection/:attribute_name/:digest`. We call this `object_type` because future versions of the specification may allow retrieving lists or attributes of other types of objects. + The `/list` and `/attribute` endpoints both use an `:object_type` path parameter. The `object_type` should always be the *singular* descriptor of objects provided by the server. In this version of the Sequence Collections specification, the `object_type` is always `collection`; so the only allowable endpoints would be `/list/collection` and `/attribute/collection/:attribute_name/:digest`. We call this `object_type` because future versions of the specification may allow retrieving lists or attributes of other types of objects. #### 3.6 OpenAPI documentation @@ -569,6 +579,7 @@ These attributes are passed through without transformation, so they show up on t Thus, we refer to them as passthru attributes. Here's how passthru attributes behave in the endpoints: + - `/list`: The server MAY allow filtering on passthru attributes, but this is not required. - `/collection`: At both level 1 and level 2, the collection object includes the same passthru attribute representation. - `/comparison`: Passthru attributes are listed in the 'attributes' section, but are not listed under 'array_elements'. @@ -586,6 +597,7 @@ We really just want the final attribute. These attributes are called transient because the content of the attribute is no longer stored and is therefore no longer retrievable. Here's how transient attributes behave in the endpoints: + - `/list`: No change; a transient attribute level1 representation can be used to list sequence collections that contain it. - `/collection`: For level 1 representation, no change; the collection object includes the transient attribute level 1 representation. For level 2 representation, there *is* a change; transient attributes have no level 2 representation on the server, so the sequence collection SHOULD leave this attribute out of the level 2 representation. - `/comparison`: Transient attributes are listed in the 'attributes' section, but are not listed under 'array_elements' because there is no level 2 representation. @@ -648,13 +660,16 @@ Finally, the 3 global qualiers are grouped under the 'ga4gh' key for consistency -### 5. Recommended ancillary attributes +### 5. Custom and recommended ancillary attributes + +The base schema described in Section 1 is a minimal starting point, which can be extended with additional custom attributes as needed. +As stated in Section one, we RECOMMEND the schemas *extend* the base schema with any custom attributes. +Furthermore, we RECOMMEND the extended schema add only non-inherent attributes, so that the top-level digests remain compatible. +Here, we specify several standard non-inherent attributes that we RECOMMEND be also included in the schema. + +Furthermore, some attributes do not need to follow the typical encoding process, for whatever reason. +Basically, custom attributes can be defined, and they are also allowed to specify their own encoding process (right?) -In *Section 1: Encoding*, we distinguished between *inherent* and *non-inherent* attributes. -Non-inherent attributes provide a standardized way for implementations to store and serve additional, third-party attributes that do not contribute to the digest. -As long as separate implementations keep such information in non-inherent attributes, the digests will remain compatible. -Furthermore, the structure for how such non-inherent metadata is retrieved will be standardized. -Here, we specify standardized, useful non-inherent attributes that we recommend. #### 5.1 The `name_length_pairs` attribute (`RECOMMENDED`) @@ -745,7 +760,7 @@ See [ADR on 2021-06-30 on array-oriented structure](decision_record.md#2021-06-3 ### F2. RFC-8785 does not apply to refget sequences -A note to clarify potential confusion with RFC-8785. While the sequence collection specification determines that RFC-8785 will be used to canonicalize the JSON before digesting, this is specific to sequence collections, it *does not apply to the original refget sequences protocol*. According to the sequences protocol, sequences are digested as un-quoted strings. If RFC-8785 were applied at the level of individual sequences, they would be quoted to become valid JSON, which would change the digest. Since the sequences protocol predated the sequence collections protocol, it did not use RFC-8785; and anyway, the sequences are just primitive types so a canonicalization scheme doesn't add anything. This leads to the slight confusion that RFC-8785 canonicalization is only applied to the objects in the sequence collections, and not to the primitives when the underlying sequences are digested. +A note to clarify potential confusion with RFC-8785. While the sequence collection specification determines that RFC-8785 will be used to canonicalize the JSON before digesting, this is specific to sequence collections, it *does not apply to the original refget sequences protocol*. According to the sequences protocol, sequences are digested as un-quoted strings. If RFC-8785 were applied at the level of individual sequences, they would be quoted to become valid JSON, which would change the digest. Since the sequences protocol predated the sequence collections protocol, it did not use RFC-8785; and anyway, the sequences are just primitive types so a canonicalization scheme doesn't add anything. This leads to the slight confusion that RFC-8785 canonicalization is only applied to the objects in the sequence collections, and not to the primitives when the underlying sequences are digested. In other words, from the perspective of Sequence Collections, we just take the sequence digests at face value, as handled by a third party; their content is not digested as part of the collections algorithm at a deeper level. ### F3. The GA4GH digest algorithm diff --git a/mkdocs.yml b/mkdocs.yml index 96d88c7..493d2c2 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -54,6 +54,7 @@ extra_css: - stylesheets/extra.css markdown_extensions: + - admonition - pymdownx.highlight: use_pygments: true - pymdownx.superfences: