-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interpretation of --maxaccepts vsearch #569
Comments
Hi Thank you for your questions. I'll try to clarify. During clustering and other many other tasks, vsearch will perform heuristic searches to find similar sequences. This is done, as you describe, by first considering the number of shared k-mers (8-mers by default) between the query and each target sequence. The target sequences are then sorted by decreasing number of shared k-mers. The sequence with the highest number of shared k-mers is considered first. If this sequence has the required amount of similarity with the query sequence in terms of percentage identity (e.g. 97%) or other requirements (depending on options used), it is "accepted". If it does not satisfy the requirements, it is "rejected". If the What happens if more than one target sequence is accepted? When clustering, the default is to sort the accepted sequences by sequence similarity and choose the target sequence, i.e. centroid, that has the highest similarity. The query sequence is then placed in that cluster. Alternatively, if the When searching, not clustering, one or more of the target sequences may be reported as hits for the query, depending on the I agree that the documentation could be clearer regarding this issue. We will try to improve it for the next release. |
Hi @torognes , |
Reopening the issue to remember to update the documentation. |
Hello,
I've got a question regarding the argument
--maxaccepts
of thevsearch
command--cluster_fast
:The manpage states the following about
maxaccepts
:"The search process sorts target sequences
by decreasing number of k-mers they have in common with the query sequence, using
that information as a proxy for sequence similarity. After pairwise alignments, if the
first target sequence passes the acceptation criteria, it is accepted as best hit and the
search process stops for that query. If --maxaccepts is set to a higher value, more hits
are accepted"
What is exactly meant with "If --maxaccepts is set to a higher value, more hits
are accepted" ?
What will happen when another hit is accepted?
I guess the
target sequences
are the centroids or seed sequences of the clusters in this case?So these are clusters (i.e. target sequences) are sorted based on number of k-mers in common, which will likely resemble pairwise sequence similarity.
I can understand that if
--maxaccepts 1
(default) is specified,vsearch
then starts to go through these pairwise alignment and selects the first one that matches the criteria (e.g. 97% similarity). Then the query sequence is placed in that cluster(?)But if e.g.
--maxaccepts 2
is specified, the query sequence can be accepted in two clusters? Or how does this work?I can imagine that the first alignment that matches the criterion is not the best one and so that you preferably check multiple accepted target sequences and select the best one from that (i.e. place your query sequence in the cluster that matches best). Is that what
--maxaccepts
is about? In that case, I would except a description like: " If --maxaccepts is set to a higher value, more hits are accepted and the best matching target sequence is finally selected as hit" or something like that.The text was updated successfully, but these errors were encountered: