-
Notifications
You must be signed in to change notification settings - Fork 125
Clustering
There is a small difference between them: --cluster_fast
will sort your sequences by length with the longest sequences first. If two sequences are equally long, they will be sorted alphabetically on the sequence label (identifier in fasta header) as the second key. --cluster_smallmem
will not sort them, but it will check that the sequences are in length-sorted order, unless you specify the --usersort
option, in which case it will not check. This behaviour is similar to usearch
, as we try to make vsearch
behave like usearch
in most cases.
In short, --cluster_fast
sorts the sequences by decreasing length, no matter what, while --cluster_smallmem
expects the sequences to be sorted by decreasing length, or according to another criteria if --usersort
is used.
The names "fast" and "smallmem" are a bit misleading as both commands are usually equally fast and memory hungry. Clustering results may be different if the sequences are processed in a different order, as it affects which sequences are used as centroids.
When using the options --msaout
and --consout
, a consensus sequence can be empty (header, but no nucleotides) under certain conditions. This is a consequence of the definition of the consensus: the consensus symbol in a column is a gap symbol (and is removed) if at least half the sequences contain a gap in that column. Otherwise the consensus symbol is the most common symbol of A, C, G or T in that column. If two symbols are equally common, the first of those symbol in alphabetical order is chosen.
That definition can trigger an issue when aligning a long centroid sequence and several shorter sequences that rarely overlap each other. The most common symbol will then be a gap symbol all over the alignment. This leads to a consensus containing only gap symbols, which will subsequently be removed, resulting in an empty sequence.
LONGSEQUENCE
LON---------
------QUE---
---GSE------
---------NCE
------------ empty consensus
Note that consensus sequences are written both with the --consout
and --msaout
commands. Gaps are removed in the consensus sequences written by --consout
but remain in the consensus sequences written by --msaout
.