-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different runs return different number of unitigs and distinct color subsets #24
Comments
I'm CCing also @jnalanko who might be interested. Have you also notice this? I can confirm that different runs of themisto may report different number of color sets, e.g., 4,236,355 vs. 4,236,354, for the same issue. |
Hi, @jermp. No, I have not noticed this. Thankfully, the correctness of Themisto should not be affected either, as long as the unitigs cover all k-mers at least once, and the colors of the unitigs not wrong. |
Yes exactly, both Themisto and Fulgor are correct anyway since the num. of kmers always seem to be correct and consistent. Just letting you know. |
Hi @jermp, the difference in color subsets is due to a (partially wanted, for performance) race condition when updating the hash table of the colors. For efficiency and technical reasons, a new color is first written to disk and then added to an in-memory hashmap that is used to deduplicate equal colors, without holding a global lock. In practice, this should not lead to incorrect results, unless you check for color equality only by comparing the color subset index without reading the colormap, also the increase in space for storing (few) duplicate colors is very small, and the computed compacted graph has always the same set of maximal unitigs. The difference in unitigs counts you're seeing is also due to the problem above since when two adjacent kmers in the same maximal unitig have a different color index (even if it refers to the same color set) they are split into multiple unitigs. Also, this problem tends to be more visible with very small datasets, where there is a high chance that two kmers processed by different threads at the exact same time share the same color. If you need the color sets to be always distinct I can try to see if there is a way to ensure that, maybe putting this requirement under an optional feature flag if it hurts performance. |
Hi @Guilucand and thank you for confirming. |
I agree it's not a hard requirement, but it would be very nice to have consistency between runs and to know the true number of colors. Differences are small in our example, but maybe could grow larger for very similar pangenomes and tons of threads. How is color equality checked, currently? As map contention is very likely to be low anyway, perhaps you could try a sharded map like DashMap to avoid almost all lock contention? Also, thanks for the quick response! |
Hello @Guilucand and @alexandrutomescu.
Let me first congratulate with you for this excellent algorithm!
We are currently using GGCAT for building Fulgor, a colored k-mer index based on SSHash (same functionalities offered by Themisto).
I noticed that, when running the same command multiple times, I get small differences in the number of unitigs/color subsets.
For example, if I run:
I sometimes get 171 color subsets and 86630 unitigs, or 172 and 86632, or 173 and 86648.
I'm inclined to think this might be an issue with multithreading since I did not notice this if I use 1 thread only.
Or, is it just something that can happen because unitigs are not required to be maximal anyway?
(The correctness of Fulgor is not broken as long as the number of kmers stays the same, as it seems to be.
But I just wanted you to know this potential issue. Ideally, I'd expect to always get the same number across different runs.)
Thank you very much!
CC: @rob-p.
The text was updated successfully, but these errors were encountered: