Comparison function does not maintain row-wise dependencies when reporting on order #36

sveinugu · 2022-09-21T13:48:06Z

There is one problem with the current solution for the comparison function that I believe we have not properly considered. It might be that we are ok the current functionality, but think that it should be conscious decision, and we should report this as a known issue.

The issue is best explained with a simple contrived example. Given the following sequence collection A:

names	lengths	sequences
chr1	12345	96f04ea2c
chr2	23456	00330e995
chr3	34567	572853213

Let's compare this with sequence collection A', where we shuffle the rows, e.g.:

names	lengths	sequences
chr3	34567	572853213
chr1	12345	96f04ea2c
chr2	23456	00330e995

The comparison function would return the following:

{
  "digests": {
    "a": "b57173a40",
    "b": "1ab89fe61"
  },
  "arrays": {
    "a-only": [],
    "b-only": [],
    "a-and-b": [
      "lengths",
      "names",
      "sequences"
    ]
  },
  "elements": {
    "total": {
      "a": 3,
      "b": 3
    },
    "a-and-b": {
      "lengths": 3,
      "names": 3,
      "sequences": 3
    },
    "a-and-b-same-order": {
      "lengths": false,
      "names": false,
      "sequences": false
    }
  }
}

So let's say we instead shuffle the names and sequences array independently, but let the lengths array follow the sequences to keep the internal consistency, such as in the following sequence collection A'':

names	lengths	sequences
chr2	23456	00330e995
chr1	34567	572853213
chr3	12345	96f04ea2c

Then comparing any two of the three collections A, A', and A'' would give the same results from the comparison function (except the digests, of course). Such a result would typically be interpreted as "they have the same sequences, the only difference is their order". But the most definitely are not the same sequences, as the names refer to different sequences in A'' compared to the other two.

The reason behind this is simply that the comparison function considers each array individually, which is again due to the fact that we are structuring the sequence collections array-wise instead of item-wise (or column-wise instead of row-wise, if you want).

Granted, this is in practice an edge case which might never happen in the data itself. But it could very possibly appear due to some coding bug. To me, having this logical flaw reduces the trust one can have to the comparison function as a consumer.

I do have a suggestion that might solve this and other related issues. Sorry for not posting this earlier, but I have been swamped with work lately.

The text was updated successfully, but these errors were encountered:

nsheff · 2024-02-07T14:36:36Z

I believe this is now solved by the sorted_name_length_pairs, right?

Edit:

Sveinung noted in #40 that it's

Storing the array of digests is a partial solution for that issue, but only for the names and lenghts arrays.

I think this is going to be enough, maybe we just note this somewhere in the docs on the comparison function. To solve this universally for all possible arrays would mean an explosion of digests -- and I'm not sure we really need the other ones, at least, not universally

In other word: I am satisfied with this limitation of the compare result. It is not intended to be comprehensive; it's intended to give you an initial look to decide if you need to look further.

sveinugu · 2024-02-07T14:58:54Z

Following in the direction of sorted_name_length_pairs, one could consider a similar (non-inherent) array type containing the values of all inherent arrays for each element, JSON canonicalized.

One other type of solution would be to add some extra functionality to the comparison endpoint to say something about the consistency of element values across arrays, in similar vein as in-same-order. This might be costly to compute.

tcezard · 2024-02-07T15:01:28Z

I think I agree with @nsheff here.
The only thing I would add is that the prevalence of this issue will depend on the data type of you're storing.
If you storing genomes the probability of having 2 genomes with the same sequences called with the same set of names but associated with different sequence is on the edge of impossibility.

I guess it might be more likely for other data source transcriptome or metagenome but in that case we can recommend the use of additional arrays that link specific attributes like the sorted_name_length_pairs array

sveinugu mentioned this issue Jun 15, 2023

Discussion on undigested attributes and sorted-name-length-pairs #40

Open

nsheff added a commit that referenced this issue Feb 7, 2024

add some detail on limitation of comparison. See #36

dfc2fcf

nsheff mentioned this issue Feb 7, 2024

Compare limitation #64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison function does not maintain row-wise dependencies when reporting on order #36

Comparison function does not maintain row-wise dependencies when reporting on order #36

sveinugu commented Sep 21, 2022 •

edited

Loading

nsheff commented Feb 7, 2024 •

edited

Loading

sveinugu commented Feb 7, 2024 •

edited

Loading

tcezard commented Feb 7, 2024

Comparison function does not maintain row-wise dependencies when reporting on order #36

Comparison function does not maintain row-wise dependencies when reporting on order #36

Comments

sveinugu commented Sep 21, 2022 • edited Loading

nsheff commented Feb 7, 2024 • edited Loading

sveinugu commented Feb 7, 2024 • edited Loading

tcezard commented Feb 7, 2024

sveinugu commented Sep 21, 2022 •

edited

Loading

nsheff commented Feb 7, 2024 •

edited

Loading

sveinugu commented Feb 7, 2024 •

edited

Loading