-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison function does not maintain row-wise dependencies when reporting on order #36
Comments
I believe this is now solved by the sorted_name_length_pairs, right? Edit: Sveinung noted in #40 that it's
I think this is going to be enough, maybe we just note this somewhere in the docs on the comparison function. To solve this universally for all possible arrays would mean an explosion of digests -- and I'm not sure we really need the other ones, at least, not universally In other word: I am satisfied with this limitation of the compare result. It is not intended to be comprehensive; it's intended to give you an initial look to decide if you need to look further. |
Following in the direction of One other type of solution would be to add some extra functionality to the comparison endpoint to say something about the consistency of element values across arrays, in similar vein as |
I think I agree with @nsheff here. I guess it might be more likely for other data source transcriptome or metagenome but in that case we can recommend the use of additional arrays that link specific attributes like the |
There is one problem with the current solution for the comparison function that I believe we have not properly considered. It might be that we are ok the current functionality, but think that it should be conscious decision, and we should report this as a known issue.
The issue is best explained with a simple contrived example. Given the following sequence collection A:
Let's compare this with sequence collection
A'
, where we shuffle the rows, e.g.:The comparison function would return the following:
So let's say we instead shuffle the
names
andsequences
array independently, but let thelengths
array follow thesequences
to keep the internal consistency, such as in the following sequence collectionA''
:Then comparing any two of the three collections
A
,A'
, andA''
would give the same results from the comparison function (except the digests, of course). Such a result would typically be interpreted as "they have the same sequences, the only difference is their order". But the most definitely are not the same sequences, as the names refer to different sequences inA''
compared to the other two.The reason behind this is simply that the comparison function considers each array individually, which is again due to the fact that we are structuring the sequence collections array-wise instead of item-wise (or column-wise instead of row-wise, if you want).
Granted, this is in practice an edge case which might never happen in the data itself. But it could very possibly appear due to some coding bug. To me, having this logical flaw reduces the trust one can have to the comparison function as a consumer.
I do have a suggestion that might solve this and other related issues. Sorry for not posting this earlier, but I have been swamped with work lately.
The text was updated successfully, but these errors were encountered: