interpretation of --maxaccepts vsearch #569

Robvh-git · 2024-08-05T13:21:43Z

Hello,

I've got a question regarding the argument --maxaccepts of the vsearch command --cluster_fast:

The manpage states the following about maxaccepts:

"The search process sorts target sequences
by decreasing number of k-mers they have in common with the query sequence, using
that information as a proxy for sequence similarity. After pairwise alignments, if the
first target sequence passes the acceptation criteria, it is accepted as best hit and the
search process stops for that query. If --maxaccepts is set to a higher value, more hits
are accepted"

What is exactly meant with "If --maxaccepts is set to a higher value, more hits
are accepted" ?

What will happen when another hit is accepted?

I guess the target sequences are the centroids or seed sequences of the clusters in this case?

So these are clusters (i.e. target sequences) are sorted based on number of k-mers in common, which will likely resemble pairwise sequence similarity.

I can understand that if --maxaccepts 1(default) is specified, vsearch then starts to go through these pairwise alignment and selects the first one that matches the criteria (e.g. 97% similarity). Then the query sequence is placed in that cluster(?)

But if e.g. --maxaccepts 2 is specified, the query sequence can be accepted in two clusters? Or how does this work?

I can imagine that the first alignment that matches the criterion is not the best one and so that you preferably check multiple accepted target sequences and select the best one from that (i.e. place your query sequence in the cluster that matches best). Is that what --maxaccepts is about? In that case, I would except a description like: " If --maxaccepts is set to a higher value, more hits are accepted and the best matching target sequence is finally selected as hit" or something like that.

The text was updated successfully, but these errors were encountered:

torognes · 2024-08-06T10:08:52Z

Hi

Thank you for your questions. I'll try to clarify.

During clustering and other many other tasks, vsearch will perform heuristic searches to find similar sequences. This is done, as you describe, by first considering the number of shared k-mers (8-mers by default) between the query and each target sequence. The target sequences are then sorted by decreasing number of shared k-mers. The sequence with the highest number of shared k-mers is considered first. If this sequence has the required amount of similarity with the query sequence in terms of percentage identity (e.g. 97%) or other requirements (depending on options used), it is "accepted". If it does not satisfy the requirements, it is "rejected". If the --maxaccepts option is used and set to higher than 1 (default), the next target sequence, with the next highest number of shared k-mers, will also be considered. If this sequence also meets the requirements (e.g. 97% identity), it will also be accepted. In this way more than one sequence may be accepted. When the maximum number of accepted sequences (option --maxaccepts, default 1) or rejected sequences (option --maxrejects, default 32) is reached, vsearch will stop considering more target sequences for this query.

What happens if more than one target sequence is accepted? When clustering, the default is to sort the accepted sequences by sequence similarity and choose the target sequence, i.e. centroid, that has the highest similarity. The query sequence is then placed in that cluster. Alternatively, if the --sizeorder option is specified, the accepted centroids will be sorted by abundance, and the centroid with the highest abundance will be chosen.

When searching, not clustering, one or more of the target sequences may be reported as hits for the query, depending on the --maxhits and --top_hits_only options.

I agree that the documentation could be clearer regarding this issue. We will try to improve it for the next release.

Robvh-git · 2024-08-09T08:36:43Z

Hi @torognes ,
thank you for the elaborate answer and it is completely clear now.
I think it indeed could be helpful to add this info to the docmentation.

torognes · 2024-08-09T13:38:12Z

Reopening the issue to remember to update the documentation.

torognes self-assigned this Aug 6, 2024

torognes added question Documentation labels Aug 6, 2024

Robvh-git closed this as completed Aug 9, 2024

torognes reopened this Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

interpretation of --maxaccepts vsearch #569

interpretation of --maxaccepts vsearch #569

Robvh-git commented Aug 5, 2024 •

edited

Loading

torognes commented Aug 6, 2024 •

edited

Loading

Robvh-git commented Aug 9, 2024

torognes commented Aug 9, 2024

interpretation of --maxaccepts vsearch #569

interpretation of --maxaccepts vsearch #569

Comments

Robvh-git commented Aug 5, 2024 • edited Loading

torognes commented Aug 6, 2024 • edited Loading

Robvh-git commented Aug 9, 2024

torognes commented Aug 9, 2024

Robvh-git commented Aug 5, 2024 •

edited

Loading

torognes commented Aug 6, 2024 •

edited

Loading