-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control of 2 separate randseed events in sintax #535
Comments
The The choice in place 2 is currently not random. When the query matches two or more sequences in the database equally well regarding the number of shared kmers, the top match is the shortest of the sequences. If two or more are equally long, the sequence that comes first in the database is chosen. If the Do you think adding optional randomness to the choice in place 2 would be valuable? |
yes, I think that adding randomness to the choice in place 2 would be valuable. In some cases a short 300 base 16S query is very similar and or even exactly identical to many long 1500 base subjects in a 16S database and these subjects have different taxonomic positions. I'd rather not have the vsearch user get the perception that the query matches one of these taxa more than the others which might happen due to the length or order rule. |
Thanks for your suggestion. I will try to add an option to randomize the choice in place 2. |
I also expect that the random selection among the equally top hits might require thousands of sequences in the udb file to be considered in the random draw. Thus, I'd expect that more computation time would be required to collect all the hits (as opposed to only the first hit), which is a fair trade-off. Thanks again for trying to add this randomness option. |
Any more ideas on implementation on this? I was considering a work-around to create 10 shuffles of the sequence ordering in the fasta database and then format a udb from each then run my queries against each and implement a summarization in a post processing step. But that would require over 15Gb of disk space for all those reference files. It might be too inefficient and would require lots of explanation in the methods section of the journal article. |
Yes, I've had a look at this right now and have already implemented it. A new option called I think this could be a significant improvement to the SINTAX algorithm as it will select a random sequence among a wider set of sequences instead of usually picking one of the shortest ones. |
Wonderful. Thank you. I think a 30% increase is runtime is less than I expected. That is good news. Looking forward to the update. |
Hi, I've now released vsearch version 2.28.1. It implements the feature you suggested. I made other improvements too, and I think the speed should be even better now. Again, thanks a lot for the suggested improvements! The The former vsearch version did not always choose the most common taxonomic entity over the 100 bootstraps among the database sequences with the highest amount of word similarity to the query. Instead, if several sequences had an equal similarity with the query, the sequence encountered in the earliest bootstrap was chosen. The confidence level was calculated based on this sequence compared to the selected sequences from the other 99 bootstraps. This could lead to a suboptimal choice with a low confidence. In the new version, the most common of the sequences with the highest amount of word similarity across the 100 bootstraps will be selected, and ties will be broken randomly. Another problem with the old implementation was that if several sequences had the same amount of word similarity, the shortest one in the reference database would be chosen, and if they were equally long, the earliest in the database file would be chosen. A new option called Furthermore, a ninth taxonomic rank, strain (letter t), is now recognized. The speed of the sintax command has also been significantly improved at least in some cases. Run vsearch with the These changes are relevant for issues #210, #325, #498, and #535. |
Thanks for these improvements. I’m working on a manuscript that will
benefit from your software engineering allowing me to leverage the
StrainSelect database all the way to the strain (t) level and produce
confidence values less-affected by db artifacts. Very exciting!
Would you consider becoming a co-author on this work?
Todd
…On Fri, Apr 26, 2024 at 6:32 AM Torbjørn Rognes ***@***.***> wrote:
Hi, I've now released vsearch version 2.28.1. It implements the feature
you suggested. I made other improvements too, and I think the speed should
be even better now. Again, thanks a lot for the suggested improvements!
The sintax command has been improved in several ways in this version of
vsearch. Please note that several details of this algorithm is not clearly
described in the preprint, and the implementation in vsearch differs from
that in usearch.
The former vsearch version did not always choose the most common taxonomic
entity over the 100 bootstraps among the database sequences with the
highest amount of word similarity to the query. Instead, if several
sequences had an equal similarity with the query, the sequence encountered
in the earliest bootstrap was chosen. The confidence level was calculated
based on this sequence compared to the selected sequences from the other 99
bootstraps. This could lead to a suboptimal choice with a low confidence.
In the new version, the most common of the sequences with the highest
amount of word similarity across the 100 bootstraps will be selected, and
ties will be broken randomly.
Another problem with the old implementation was that if several sequences
had the same amount of word similarity, the shortest one in the reference
database would be chosen, and if they were equally long, the earliest in
the database file would be chosen. A new option called sintax_random has
now been introduced. This option will randomly select one of the sequences
with the highest number of shared words with the query, without considering
their length or position. This avoids a bias towards shorter reference
sequences. This option is strongly recommended and will probably soon be
the default.
Furthermore, a ninth taxonomic rank, strain (letter t), is now recognized.
The speed of the sintax command has also been significantly improved at
least in some cases. Run vsearch with the randseed option and 1 thread to
ensure reproducibility of the random choices in the algorithm.
These changes are relevant for issues #210
<#210>, #325
<#325>, #498
<#498>, and #535
<#535>.
—
Reply to this email directly, view it on GitHub
<#535 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BBHQGOH4S4F5RYIALZL4LYDY7JJPTAVCNFSM6AAAAAA6EM5PZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGQYDKNZWGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks - I've sent you an email. |
I see two places where sintax would need a random seed:
Can setting the randseed option be applied to both of these random events to create a reproducible output?
The text was updated successfully, but these errors were encountered: