Sequence characteristic metadata #486

ahwagner · 2023-07-13T15:29:08Z

ahwagner
Jul 13, 2023
Maintainer

References to sequences via GA4GH SQ.<digest> identifiers has been a feature of VRS since 1.0. However, these digests (based on the RefGet spec) do not contain information about the sequence alphabet or the molecule circularity (circular / linear), making implementation of methods for sequence operations on different molecule types challenging. Some initial work is being planned in biocommons/biocommons.seqrepo#113 to implement and use these metadata for sequence operations. In VRS, we will need to consider how to capture these metadata as a separate structure, since a solution to this issue will not be addressed in the upcoming RefGet v2.

An initial implementation proposal is to create an annotation wrapper layer as a separate class, that could then be referenced by parent objects such as SequenceLocation. We have drafted such a class as an example, here:

vrs/schema/vrs-source.yaml

Lines 422 to 447 in c693ac3

    
           SequenceReference: 
        
             maturity: Alpha 
        
             inherits: ValueObject 
        
             ga4ghDigest: 
        
               prefix: SQR 
        
               keys: 
        
                 - refgetAccession 
        
             type: object 
        
             description: A sequence of nucleic or amino acid character codes. 
        
             properties: 
        
               type: 
        
                 type: string 
        
                 const: "SequenceReference" 
        
                 default: "SequenceReference" 
        
                 description: MUST be "SequenceReference" 
        
               refgetAccession: 
        
                 description: A `GA4GH RefGet <http://samtools.github.io/hts-specs/refget.html>` identifier for the referenced sequence, using the sha512t24u digest. 
        
                 type: string 
        
                 pattern: 'SQ.[0-9A-Za-z_\-]{32}' 
        
               residueAlphabet: 
        
                 type: string 
        
                 enum: 
        
                   - amino acid 
        
                   - nucleic acid 
        
             required: 
        
               - refgetAccession

.

Note this requires creating a separate type prefix (SQR., for Sequence Reference), as this class references a molecular sequence but is distinct from the sequence as defined by RefGet (with reserved SQ. namespace). A drawback of this approach is an additional layer of indirection that implementations will need to support to get back to RefGet accessions.

Opening this for discussion and progress on this topic for VRS 2.0.

ahwagner · 2023-10-05T16:17:38Z

ahwagner
Oct 5, 2023
Maintainer Author

This issue was raised during plenary. Our current 2.0-alpha model does away with a computed identifier for this class (no SQR.). In this new paradigm, we have allowed the digest from the refSeq accession to be passed through for computed digests of parent objects.

This decision allowed us to simplify compute by reducing the number of hash operations needed; since every variation and location in VRS 2.0 requires at least one sequence reference, this has a performance impact on every identifiable VRS object created, at the cost of computed identifier method complexity (requiring implementations to support custom digest routines for this specific class, as pointed out by @theferrit32 during Fall 23 Connect). This also reduces complexity of sequence retrieval operations, since the digest is compatible with RefSeq implementations (including biocommons.seqrepo). This decision also implies that no attributes of the SequenceReference class fundamentally change the meaning of a variant.

As rightly pointed out by @andrewyatz during Connect (and in my comment at the beginning of this issue), this implication is technically untrue in certain classes of variation. Specifically, when a sequence can represent either a valid amino acid or a nucleic acid string, the meaning of the sequence (and variants on the sequence) is quite different. We did not make this distinction in VRS 1.x, and the current implementation continues this, though a new optional field of residueAlphabet in the SequenceReference class provides a mechanism for implementations to explicitly state the alphabet used by the sequence. I am okay with leaving this digest ambiguity, as this technical possibility has not been observed in practice and is (in my opinion) outweighed by the benefits of simplifying sequence retrieval with other standards. If we were to make new digests including residueAlphabet values, this would make the field required and necessitate implementations to annotate sequences with alphabet information in order to generate VRS objects; increasing complexity and potentially reducing adoption.

Another point raised at Connect was the idea that we now have multiple objects of different types (RefGet sequences and VRS SequenceReference objects) that share the same ID. To be clear, that is not the case; the id field of a SequenceReference object is not required to be a GA4GH computed identifier, nor is the SequenceReference class GA4GH identifiable (no longer having a SQR. or other prefix).

I invite discussion about concerns with this approach on this thread.

0 replies

andrewyatz · 2023-10-12T11:14:20Z

andrewyatz
Oct 12, 2023
Maintainer

I agree residueAlphabet is not an intrinsic part of the sequence and the advantages laid out here are important. I'm not sure if I think the designation of sequence type should be at the SequenceReference level or higher. residueAlphabet is good for disambiguating between dna/protein but doesn't help disambiguate between genomic or coding aside from protein (the 3 levels I am assuming variation should be recorded against). Would appreciate input on this point if it is seen as useful to know are we talking about a genomic variant or one against a coding sequence.

0 replies

rrfreimuth · 2023-10-18T04:46:12Z

rrfreimuth
Oct 18, 2023

I think I understand the use cases and rationale underlying the proposal at Connect, but to be honest it feels like the model conflates a couple of concepts. This simplification may be acceptable. My primary concern is that the model supports the minimum level of semantic expressiveness that is required to meet all current (and hopefully near-future) use cases. If it does that, then it will also provide a degree of stability over time that will allow implementers to build off of it. Because these are such fundamental concepts at the core of so many use cases, changes at this level can have ripple effects through anything that uses it. Therefore, I think we should take a little time to ensure we get it as right as we can.

The discussion regarding inherent properties of sequence made clear (to me, anyway) that there are at least two concepts involved:

the sequence itself, which refers to the primary, linear arrangements of subunits¹
the representation of the sequence, which refers to how we instantiate the primary, linear arrangement

This distinction is reflected in this statement, which differentiates between the "meaning" and the "representation" of a sequence:

Specifically, when a sequence can represent either a valid amino acid or a nucleic acid string, the meaning of the sequence (and variants on the sequence) is quite different.

M vs. Met: Clearly different ASCII representations, but do they mean the same sequence?
ATG vs. ATG: Nucleotide start codon or tripeptide? Do they mean the same sequence?

In another example: The codon changed in EGFR T790M (2369C>T)² is ACG to ATG, which can be expressed as AYG using IUPAC nucleotide ambiguity codes. All of those components (T, M, C, T, ACG, ATG, AYG) use valid nucleotide or protein alphabets. While most of those components clearly mean different things (based on string identity), one T represents threonine while another T represents thymine. Both are used in the context of describing a single variation, and both would have the same LiteralSequenceExpression³.

The fact that we have been wrestling with this issue suggests this distinction is important for at least some of our use cases, but because we are experimenting with both expanded and simplified models it might not be necessary in all implementations.

In another group⁴, our model supports the distinction between sequence "concept" and sequence representation (which includes a field for alphabet) because of the heterogeneity that exists in representation yet the need to support identity-based operations that determine if two sequences are semantically the same. A single sequence could be expressed using multiple representations, which permit domain-friendly conventions and which can be normalized to the same literal sequence expression to permit identity operations.

I think we need to determine whether it is important for us to be able to not only assess when two instances have identical textual serializations (the digest), but also if they mean the same thing. If we can come to conclusion on that, then the questions about what to include in the model (and how) might become more clear.

Note that the type of subunit may be a critical piece of information that defines the type of sequence, but this is not necessarily sufficient in all cases. Concepts such as the physical topology of the molecule (linear vs circular) and distinctions based on biological function (e.g., mRNA vs rRNA, or genomic DNA vs cDNA) represent additional concepts that are not captured by sequence or subunit type. Nonetheless, it may be convenient to model and represent these concepts as a single attribute unless use cases are presented that require additional levels of detail. ↩
ClinVar ↩
Please correct me if my understanding is out of date. ↩
Sorry... ↩

0 replies

2024-01-09T02:08:19Z

github-actions[bot]
bot Jan 9, 2024

This issue was marked stale due to inactivity.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence characteristic metadata #486

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Sequence characteristic metadata #486

ahwagner Jul 13, 2023 Maintainer

Replies: 4 comments

ahwagner Oct 5, 2023 Maintainer Author

andrewyatz Oct 12, 2023 Maintainer

rrfreimuth Oct 18, 2023

Footnotes

github-actions[bot] bot Jan 9, 2024

ahwagner
Jul 13, 2023
Maintainer

ahwagner
Oct 5, 2023
Maintainer Author

andrewyatz
Oct 12, 2023
Maintainer

rrfreimuth
Oct 18, 2023

github-actions[bot]
bot Jan 9, 2024