Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison function does not maintain row-wise dependencies when reporting on order #36

Open
sveinugu opened this issue Sep 21, 2022 · 3 comments

Comments

@sveinugu
Copy link
Collaborator

sveinugu commented Sep 21, 2022

There is one problem with the current solution for the comparison function that I believe we have not properly considered. It might be that we are ok the current functionality, but think that it should be conscious decision, and we should report this as a known issue.

The issue is best explained with a simple contrived example. Given the following sequence collection A:

names lengths sequences
chr1 12345 96f04ea2c
chr2 23456 00330e995
chr3 34567 572853213

Let's compare this with sequence collection A', where we shuffle the rows, e.g.:

names lengths sequences
chr3 34567 572853213
chr1 12345 96f04ea2c
chr2 23456 00330e995

The comparison function would return the following:

{
  "digests": {
    "a": "b57173a40",
    "b": "1ab89fe61"
  },
  "arrays": {
    "a-only": [],
    "b-only": [],
    "a-and-b": [
      "lengths",
      "names",
      "sequences"
    ]
  },
  "elements": {
    "total": {
      "a": 3,
      "b": 3
    },
    "a-and-b": {
      "lengths": 3,
      "names": 3,
      "sequences": 3
    },
    "a-and-b-same-order": {
      "lengths": false,
      "names": false,
      "sequences": false
    }
  }
}

So let's say we instead shuffle the names and sequences array independently, but let the lengths array follow the sequences to keep the internal consistency, such as in the following sequence collection A'':

names lengths sequences
chr2 23456 00330e995
chr1 34567 572853213
chr3 12345 96f04ea2c

Then comparing any two of the three collections A, A', and A'' would give the same results from the comparison function (except the digests, of course). Such a result would typically be interpreted as "they have the same sequences, the only difference is their order". But the most definitely are not the same sequences, as the names refer to different sequences in A'' compared to the other two.

The reason behind this is simply that the comparison function considers each array individually, which is again due to the fact that we are structuring the sequence collections array-wise instead of item-wise (or column-wise instead of row-wise, if you want).

Granted, this is in practice an edge case which might never happen in the data itself. But it could very possibly appear due to some coding bug. To me, having this logical flaw reduces the trust one can have to the comparison function as a consumer.

I do have a suggestion that might solve this and other related issues. Sorry for not posting this earlier, but I have been swamped with work lately.

@nsheff
Copy link
Member

nsheff commented Feb 7, 2024

I believe this is now solved by the sorted_name_length_pairs, right?

Edit:

Sveinung noted in #40 that it's

Storing the array of digests is a partial solution for that issue, but only for the names and lenghts arrays.

I think this is going to be enough, maybe we just note this somewhere in the docs on the comparison function. To solve this universally for all possible arrays would mean an explosion of digests -- and I'm not sure we really need the other ones, at least, not universally

In other word: I am satisfied with this limitation of the compare result. It is not intended to be comprehensive; it's intended to give you an initial look to decide if you need to look further.

@sveinugu
Copy link
Collaborator Author

sveinugu commented Feb 7, 2024

Following in the direction of sorted_name_length_pairs, one could consider a similar (non-inherent) array type containing the values of all inherent arrays for each element, JSON canonicalized.

One other type of solution would be to add some extra functionality to the comparison endpoint to say something about the consistency of element values across arrays, in similar vein as in-same-order. This might be costly to compute.

@tcezard
Copy link
Collaborator

tcezard commented Feb 7, 2024

I think I agree with @nsheff here.
The only thing I would add is that the prevalence of this issue will depend on the data type of you're storing.
If you storing genomes the probability of having 2 genomes with the same sequences called with the same set of names but associated with different sequence is on the edge of impossibility.

I guess it might be more likely for other data source transcriptome or metagenome but in that case we can recommend the use of additional arrays that link specific attributes like the sorted_name_length_pairs array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants