Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update digest serialization rules in docs #410

Closed
korikuzma opened this issue Nov 1, 2022 · 14 comments
Closed

Update digest serialization rules in docs #410

korikuzma opened this issue Nov 1, 2022 · 14 comments
Assignees

Comments

@korikuzma
Copy link
Contributor

The digest serialization docs do not explicitly say how to handle arrays of objects that can't be serialized (i.e., Genotype.members) but still have the property ordered=False. Below is a proposed example from @ahwagner where we serialize each GenotypeMember and sort Genotype.members based on the serialized strings.

Blank diagram(2)

@larrybabb
Copy link
Contributor

@ahwagner sorry to need reminding (again), but why is the ordering important to the genotypeMembers array?

@ahwagner
Copy link
Member

ahwagner commented Nov 2, 2022

@larrybabb the members array has property ordered: false (here), indicating that the order of elements in the Genotype.members array is not meaningful. The consequence of this is that when creating a computed digest for the Genotype, we need to have a consistent approach to ordering these elements so we get the same digest regardless of the element order.

To date, we have only had identifiable objects in arrays, and for those we would compute the digests of the array objects and then sort the array lexicographically. In this new case, Genotype.members elements are GenotypeMember JSON objects that are not identifiable and so we need a strategy for ordering them. My proposal, captured by @korikuzma above, is that we extend our existing serialization strategy to first serialize non-identifiable objects in arrays at the same time we compute digests for identifiable objects in array. Then all array objects are sorted lexicographically. I think the PR associated with this issue should also include some documentation updates for the digest serialization strategy to make this clear.

@larrybabb
Copy link
Contributor

@ahwagner got it. thank you for re-explaining that.
Question: what if we dropped the ordered property altogether and presumed that all arrays are unordered (I presume that will be the rule vs the exception). If an attribute that is an array comes along that needs to be ordered then we can add that special attribute to it to describe how it should be sorted or ordered (assuming there might be more than one way?). If there is only going to be one way to "order" arrays that need to be ordered then we can possibly use the idea of naming the attributed with the prefix orderedXXX: thus avoiding the need for this additional boolean property.

It's just that an ordered: t/f attribute in a json message doesn't tell me much in terms of being able to validate the information on the receiving end. If we have to presume the order is correct and must be preserved then qualifying the attribute with ordered should/would provide just as much information as a boolean flag. We've avoided flags like this to date. It seems like we may want to continue avoiding them.

if the genotype.members was an ordered array then we would call it genotype.orderedMembers and presume that whatever order the members are placed in the array must be preserved. Another possibility is to define a class called OrderedArray that can be used for any attributes that need to have their order preserved.

Thoughts?

@reece
Copy link
Member

reece commented Nov 3, 2022

Do I understand correctly that the proposal is to add a non-standard "ordered" attribute to the members property of the message?

@larrybabb
Copy link
Contributor

I think so. My questions above are for @ahwagner and offer some other options possibly. Let's see how he responds.

@ahwagner
Copy link
Member

ahwagner commented Nov 3, 2022

@reece and @larrybabb there are two concerns here. Tagging @andreasprlic because this is an important technical implementation discussion he should also weigh in on.

Concern 1. we need to sort some JSON arrays and not sort others for digest serialization

JSON Schema does not differentiate between arrays and sets, but VRS does. We represent all sets in VRS (e.g. VariationSet.members, Genotype.members) as arrays in JSON Schema, and this represents the typical case for an array in VRS. However, when we introduced ComposedSequenceExpressions (CSEs), we needed a way of differentiating between ordered arrays for CSEs and unordered arrays for everything else. At the time, our digest serialization rules were sufficient to resolve this, as we only reordered arrays if they contained identifiers (as pointed out by @reece, here). Since CSEs were not identifiable, they did not get changed into identifiers, and no sorting took place.

Later, however, we added Genotype, which contained an unordered members array with non-identifiable GenotypeMember objects. Because these are non-identifiable our current digest serialization rules treat this array the same way it treats a CSE array: as ordered (in this case, incorrectly). This issue (#410) and the associated PR (#409) were created to prompt discussion about updating the serialization rules to handle this situation. In addition to the need to sort this array, we need to define a mechanism that allows us to sort JSON Objects (which are evaluated as dicts in Python and have no default sorting comparison behavior). This proposal above defines a sorting behavior.

As an aside, one potential decision that would sidestep this issue (for now) is to make GenotypeMember objects identifiable, but I think this is not the correct design choice as GenotypeMembers are intended to be used only as a nested class within Genotype. Making them globally identifiable is contrary to that intended use. If there is disagreement on that underlying decision the conversation should start there.

Concern 2. define a mechanism that allows us to uniformly indicate sort behavior across classes

During the GKS-Pilot work we opted to use the ordered attribute in the JSON Schema (as opposed to within-message solutions) to handle the challenge of Concern 1. As the JSON Schema is parsed by VRS-Python and the JSON Schema specification allows for custom attributes, this seemed like a good solution to explicitly define ordered/unordered array behavior on a per-array level without increasing message size or changing digests of VRS objects. We discussed the advantages of a schema-level attribute on this thread. @larrybabb I think we saw eye-to-eye on this at the time, but @reece and @andreasprlic did not weigh in there, so it seems appropriate to me to revisit this decision and get their comments and come to a consensus decision.

There are many approaches we may take to address this concern, including some previously suggested ones:

Schema-based approaches (not in message)

  • ordered boolean property for arrays in JSON Schema (currently implemented solution)

Message-based approaches (defined in schema and explicit in message)

Documentation approaches (implementation concern only; not in message or schema)

  • Revise serialization rules and/or class-specific implementation guidance to describe the expected sort behavior of arrays

I ask that we keep the discussion on Concern 1: array sorting proposal in this thread and discuss Concern 2: indicating sort behavior (which is dependent on resolving the first concern) in a separate issue (#411).

@larrybabb
Copy link
Contributor

@ahwagner I get what you are saying. Thank you (again) for taking the time and effort to lay out the details above.

It seems like the 2 issues are

  1. When an array contains items with no specified ordering and are also un-identifiable (no computed digest) we need a mechanism to make sure we can digest the parent class of the array element consistently. This is the Genotype.member[] problem we are currently trying to address.

  2. When an array has a need to be in a specific order whether the elements are identifiable or not, we need a mechanism to assure the ordering is preserved. This one is less clear to me here because in the CSEs example you cite above there is no value in the json that would allow one to understand what the intended order is. I'm guessing we simply presume it is in the correct order with no way of verifying it? This seems a bit risky, and I would need some more clarity on why this is acceptable to us. If we are saying that we are not responsible for the ordering and presume however we create the data is the order it is in and the data does not need to contain any values (i.e. indices) to clarify the accuracy of the order, then we really shouldn't order anything ever.

The CSE case should probably be set up as a linked list or nesting construct to make sure that the chain/hierarchy is semantically included in the data. Maybe all ordered lists should use a designed construct to preserve those semantics?

For any data that has no ordering we would only need to solve the issue of digesting un-identifiable components in lists. For this we should assume that any items in a Value object array would meet the requirement of being a value object too and we should simply digest these elements and sort them by their value object digest, even though we don't ever persist or preserve these non-identifiable element's ids.

Again, I apologize for re-surfacing these issues. I think the idea of having an ordered flag attribute in any/all array parent classes is a bit difficult to accept as a final solution to this issue.

@ahwagner
Copy link
Member

ahwagner commented Nov 6, 2022

Separating out these distinct concerns between this thread and #411.

Addressing here:

  1. When an array contains items with no specified ordering and are also un-identifiable (no computed digest) we need a mechanism to make sure we can digest the parent class of the array element consistently. This is the Genotype.member[] problem we are currently trying to address.
    ...
    For any data that has no ordering we would only need to solve the issue of digesting un-identifiable components in lists. For this we should assume that any items in a Value object array would meet the requirement of being a value object too and we should simply digest these elements and sort them by their value object digest, even though we don't ever persist or preserve these non-identifiable element's ids.

I think this approach is very similar to the proposal laid out above. The array content needs to be digest serialized, then digested. I had proposed we sort on the digest serialized outputs to save on the extra compute of creating digests (since these are not persisted anyways). Is there a reason we want to take the extra step to create digests for these objects, e.g. consistency with the approach for identifiable objects? I'm okay with that, but just want to acknowledge the extra compute expense associated with this decision.

@ahwagner
Copy link
Member

ahwagner commented Nov 7, 2022

On 11/7 leads call, @larrybabb @andreasprlic and @ahwagner agreed that sorting on digests (even for non-identifiable objects) is the more consistent approach.

Proposed Resolution: digest ALL JSON Objects in arrays for sorting during digest serialization, UNLESS the array order is meaningful (indicated as described in resolution of #411).

@korikuzma
Copy link
Contributor Author

Blank diagram(3)
Here is an updated visualization of the proposed solution. Since ordered=False for Genotype.members, the final serialization will always be the same regardless of Genotype.members order

@ahwagner
Copy link
Member

ahwagner commented Nov 9, 2022

Implemented in #409

@ahwagner ahwagner closed this as completed Nov 9, 2022
@korikuzma
Copy link
Contributor Author

@ahwagner I did not update the docs with this change. Did you want me to do this in a separate PR?

@ahwagner
Copy link
Member

ahwagner commented Nov 9, 2022

Yes, good catch. Reopening this issue until the documentation is updated.

@ahwagner ahwagner reopened this Nov 9, 2022
Copy link

github-actions bot commented Jan 9, 2024

This issue was marked stale due to inactivity.

@github-actions github-actions bot added the Stale See .github/workflows/stale.yml label Jan 9, 2024
@ahwagner ahwagner removed the Stale See .github/workflows/stale.yml label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants