Replies: 4 comments
-
This issue was raised during plenary. Our current 2.0-alpha model does away with a computed identifier for this class (no This decision allowed us to simplify compute by reducing the number of hash operations needed; since every variation and location in VRS As rightly pointed out by @andrewyatz during Connect (and in my comment at the beginning of this issue), this implication is technically untrue in certain classes of variation. Specifically, when a sequence can represent either a valid amino acid or a nucleic acid string, the meaning of the sequence (and variants on the sequence) is quite different. We did not make this distinction in VRS Another point raised at Connect was the idea that we now have multiple objects of different types (RefGet sequences and VRS SequenceReference objects) that share the same ID. To be clear, that is not the case; the I invite discussion about concerns with this approach on this thread. |
Beta Was this translation helpful? Give feedback.
-
I agree |
Beta Was this translation helpful? Give feedback.
-
I think I understand the use cases and rationale underlying the proposal at Connect, but to be honest it feels like the model conflates a couple of concepts. This simplification may be acceptable. My primary concern is that the model supports the minimum level of semantic expressiveness that is required to meet all current (and hopefully near-future) use cases. If it does that, then it will also provide a degree of stability over time that will allow implementers to build off of it. Because these are such fundamental concepts at the core of so many use cases, changes at this level can have ripple effects through anything that uses it. Therefore, I think we should take a little time to ensure we get it as right as we can. The discussion regarding inherent properties of sequence made clear (to me, anyway) that there are at least two concepts involved:
This distinction is reflected in this statement, which differentiates between the "meaning" and the "representation" of a sequence:
In another example: The codon changed in EGFR T790M (2369C>T)2 is The fact that we have been wrestling with this issue suggests this distinction is important for at least some of our use cases, but because we are experimenting with both expanded and simplified models it might not be necessary in all implementations. In another group4, our model supports the distinction between sequence "concept" and sequence representation (which includes a field for alphabet) because of the heterogeneity that exists in representation yet the need to support identity-based operations that determine if two sequences are semantically the same. A single sequence could be expressed using multiple representations, which permit domain-friendly conventions and which can be normalized to the same literal sequence expression to permit identity operations. I think we need to determine whether it is important for us to be able to not only assess when two instances have identical textual serializations (the digest), but also if they mean the same thing. If we can come to conclusion on that, then the questions about what to include in the model (and how) might become more clear. Footnotes
|
Beta Was this translation helpful? Give feedback.
-
This issue was marked stale due to inactivity. |
Beta Was this translation helpful? Give feedback.
-
References to sequences via GA4GH
SQ.<digest>
identifiers has been a feature of VRS since 1.0. However, these digests (based on the RefGet spec) do not contain information about the sequence alphabet or the molecule circularity (circular / linear), making implementation of methods for sequence operations on different molecule types challenging. Some initial work is being planned in biocommons/biocommons.seqrepo#113 to implement and use these metadata for sequence operations. In VRS, we will need to consider how to capture these metadata as a separate structure, since a solution to this issue will not be addressed in the upcoming RefGet v2.An initial implementation proposal is to create an annotation wrapper layer as a separate class, that could then be referenced by parent objects such as
SequenceLocation
. We have drafted such a class as an example, here:vrs/schema/vrs-source.yaml
Lines 422 to 447 in c693ac3
Note this requires creating a separate type prefix (
SQR.
, for Sequence Reference), as this class references a molecular sequence but is distinct from the sequence as defined by RefGet (with reservedSQ.
namespace). A drawback of this approach is an additional layer of indirection that implementations will need to support to get back to RefGet accessions.Opening this for discussion and progress on this topic for VRS 2.0.
Beta Was this translation helpful? Give feedback.
All reactions