Sintax taxonomy classifier #210

davidealbanese · 2016-10-26T07:50:40Z

Dear @torognes, are you planning to add the sintax classifier http://biorxiv.org/content/early/2016/09/09/074161?

Thank you,
Davide

torognes · 2016-10-26T07:57:30Z

Thanks for the suggestion. We might add some kind of taxonomic classifier to VSEARCH in the future, but there are no firm plans at the moment.

lanzen · 2017-10-16T11:28:36Z

Agree that this would be very useful! We have recently developed a SINTAX formatted version of the SilvaMod database based on Silva and part of CREST (https://github.com/lanzen/CREST/tree/master/LCAClassifier). Unfortunately, it cannot be used without a 64-bit license of usearch since it is too large.

GeoMicroSoares · 2017-11-12T21:45:29Z

I'll second this! Think of the opportunities now that Nanopore sequencing is booming.

chiras · 2017-12-15T12:43:51Z

@torognes any updates in that regard? would be great to have a hierarchic classification procedure likewise to utax/sintax.

torognes · 2017-12-19T10:21:55Z

I still agree that this would be very useful to include and one of the top features to prioritise, but I do not know when I can find time to implement it.

GeoMicroSoares · 2017-12-20T08:34:21Z

We'll be waiting or hopefully someone can contribute useful code meanwhile! :) Thanks @torognes

torognes · 2018-03-02T12:18:26Z

Due to popular demand, I have implemented the sintax command for taxonomic classification.

The sintax command has been added with --sintax_cutoff and --tabbedout options.

It implements the Sintax algorithm as described in Robert Edgar's preprint:

Robert Edgar (2016)
SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences
BioRxiv, 074161
doi: https://doi.org/10.1101/074161

Further details: https://www.drive5.com/usearch/manual/cmd_sintax.html

Multithreading is supported. Databases in UDB files are supported. Strand option may be specified.

This is a new feature that has been only very briefly tested. Feedback is therefore highly welcomed!

davidealbanese · 2018-03-02T12:43:20Z

Great news, thanks! I will test it soon...

Andreas-Bio · 2018-03-02T12:43:34Z

There are some issues with the (original) sintax command that are prohibiting its use for me (and potentially others):

-the self testing is very unflexible (can only self test the whole database at once against the whole database using LOOCV); instead of LOOCV with selected sequences only
-the algorithm only outputs the first hit irrespective of the hits after that, so it may be an ambiguous hit (top hits identical) and this may not become fully clear, a hit list like BLAST would be much more transparent
-the sintax algorithm is vulnerable to an inconsistent number of sequences per species, if species A has 15 sequences and species B has 1 sequence and that one sequence is identical to species A, than species A will be on the output but almost never species B if the database is queried with species B
-the sintax algorithm is forced length-sensitive, this is unwelcome if there are a lot of partial sequences in the database, as a shorter sequence which is identical (a subset) of a longer sequence will be discriminated against in the results

If you are re-developing the sintax algorithm maybe some of these issues could be resolved very easily.

chiras · 2018-03-02T13:17:35Z

Thank you very much, absolutely appreciated!

lanzen · 2018-03-02T13:20:43Z

Tusen takk Torbjørn, Detta kommer helt klart å være en viktig resurs for meg, spesielt siden gratisversionen av SINTAX ikke en gang klarer en database like stor som nyeste NR-versionen av SILVA. Vennlig hilsen, Anders

…

On Fri, Mar 2, 2018 at 2:17 PM, Alexander Keller ***@***.***> wrote: Thank you very much, absolutely appreciated! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#210 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHCkUQRa9sUlGD7P-zh2H-LdFUutngVeks5taUZ1gaJpZM4Kg2_T> .

torognes · 2018-03-08T15:23:46Z

Thanks for your feedback.

So far I have just tried to implement the SINTAX algorithm as described in the preprint. I understand that there is some disagreement about the quality of the algorithm and some issues have been raised. I will look more into these and see if it is possible to improve it or to implement a different algorithm. Please tell me if you have any specific ideas for improvement.

Andreas-Bio · 2018-03-21T14:48:33Z

Thanks for your reply.
I think it is important to have an output like BLAST ( https://www.drive5.com/usearch/manual/blast6out.html ) , where you have a list of hits. This enables the user to make an informed decision and at the same time LOOCV is much easier. Additionally it will show ambiguous (identical k-mers between query and hit) hits immediately (rather than having to guess, the situation at the moment). It would make it also much easier to see contaminating sequences (if the whole list is family x but one hit is family y).

Something like: 1) label_query 2) label_hit 3) length_query 4 length_hit 5) percent_similarity_kmers_query_and_hit 6) bootstrap value 7) number of kmers that are not identical ...?
Most of these number must be internally available, and if not, they should be able to be extracted with one line of code.

Leo-alves · 2018-10-31T19:30:26Z

Hey there.
Just to drop my five cents...
Have been working myself on improving taxonomic classification using Vsearch (not further sintax algorithm) and I am stuck on that same issues as andzandz11 mentioned.
Deep-level classification, my case species, are at many times just impossible to define, also because of the 16S variant regions we sequence. I am working with a highly curated database which I'am relativelly confident there is no high level of misannotation. Still, for a lot of sequence variants I came across dubious taxonomies, such as the following:

A previously SV was assigned as Bacillus anthracis by vsearch with –maxaccepts 1 (disclaimer: at that point I run vsearch implemented in Qiime2 workflow). I then rerun it against the same db with vsearch outside qiime:
$ vsearch --usearch_global sequence_variant.fa --db db --id 0.99 --blast6out out --maxaccepts 50000
From that I got 811 >=99% id hits. Then, I ranked the taxonomies by percentage of that 811 hits.
Taxonomy | percentage
D_5__Bacillus;D_6__Bacillus_cereus | 36,7
D_5__Bacillus;D_6__Bacillus_sp. | 35,4
D_5__Bacillus;D_6__Bacillus_thuringiensis | 12,1
D_5__Bacillus;D_6__Bacillus_anthracis | 5,9
D_5__Bacillus;D_6__Bacillus_mycoides | 3,6
D_5__Bacillus;D_6__Bacillus_subtilis | 1,1
D_5__Bacillus;D_6__Bacillus_pseudomycoides | 1,0
D_5__Streptococcus;D_6__Streptococcus_pneumoniae | 0,9
D_5__Bacillus;D_6__Bacillus_toyonensis | 0,7
D_5__unclassified_Bacillaceae;D_6__unclassified_Bacillaceae | 0,6
D_5__Enterobacter;D_6__Enterobacter_cloacae | 0,5
D_5__Brevibacillus;D_6__Brevibacillus_brevis | 0,2
D_5__Bacillus;D_6__Bacillus_samanii | 0,2
D_5__Bacillus;D_6__Bacillus_gaemokensis | 0,2
D_5__Staphylococcus;D_6__Staphylococcus_sp. | 0,1
D_5__Staphylococcus;D_6__Staphylococcus_aureus | 0,1
D_5__Paenibacillus;D_6__Paenibacillus_sp. | 0,1
D_5__Bacillus;D_6__Bacillus_pumilus | 0,1
D_5__Bacillus;D_6__Bacillus_marcorestinctum | 0,1
D_5__Bacillus;D_6__Bacillus_amyloliquefaciens | 0,1

Turns out what was B. anthracis looks like more for B. cereus. However, one should consider that a few species actually are so so similar to B. cereus that we have B. cereus group. So, summing all the % of B. cereus group members’ I got 53% B. cereus group. Then, we have the number of taxonomies for each given species in the databank. If there is an enrichment of a taxonomy (exactly like he mentioned) the output tends to deviate to that assignment when ranking. And that is partly happening to this example, because I have 431 B. cereus x 73 B. anthracis seqs in the db. I told partly because 36.7% of 881 = 323 and 5.9% = 52, but 323 is 75% of 431 while 52 is 71% of 73. I actually see a tendency on that, where the %of hits from B. cereus, B. thuringiensis, B. anthracis and B. micoydes are 75-73-71-68%. Well, at the end I would consider this sequence as “B. cereus group” and not B. thuringiensis.
I also noticed that the top hits in a blast6output are reported in order of their positions in the database. That makes sense as vsearch finds an alignment and output it. However, for a sequence similar to more than one species (but not ultra-similar like the B. cereus group) that means the forst one present in the db will likelly be assigned. Because of that a classification system that consider what is related to the top hit fails, at least in my case.
So, what I am doing after ranking is detecting the groups of species I know are closelly related and putting them toghether, as the B. cereus group mentioned. But, if a member of the group (excluding B. cereus) is highly ranked, eg.:60%+, I then accept the taxonomy.
Another serious problem is that for many samples that were sequenced from an isolated bacteria whose species is know by us still return dubious taxonomies, like 50-50% taxA-taxB, being taxA the correct one and for that I just don’t know what to do.

torognes · 2024-04-26T13:35:44Z

I have made several improvements to the sintax command in vsearch 2.28.1, just released. Please see issue #535 or the release notes for details.

torognes added the enhancement label Oct 26, 2016

torognes referenced this issue Mar 2, 2018

Added sintax command

97362d7

diegomic mentioned this issue Jul 26, 2018

sintax classifier and multiple identical best hits #325

Open

torognes mentioned this issue Apr 26, 2024

control of 2 separate randseed events in sintax #535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sintax taxonomy classifier #210

Sintax taxonomy classifier #210

davidealbanese commented Oct 26, 2016

torognes commented Oct 26, 2016

lanzen commented Oct 16, 2017

GeoMicroSoares commented Nov 12, 2017

chiras commented Dec 15, 2017

torognes commented Dec 19, 2017

GeoMicroSoares commented Dec 20, 2017

torognes commented Mar 2, 2018

davidealbanese commented Mar 2, 2018

Andreas-Bio commented Mar 2, 2018 •

edited

Loading

chiras commented Mar 2, 2018

lanzen commented Mar 2, 2018 via email

torognes commented Mar 8, 2018

Andreas-Bio commented Mar 21, 2018

Leo-alves commented Oct 31, 2018

torognes commented Apr 26, 2024

Sintax taxonomy classifier #210

Sintax taxonomy classifier #210

Comments

davidealbanese commented Oct 26, 2016

torognes commented Oct 26, 2016

lanzen commented Oct 16, 2017

GeoMicroSoares commented Nov 12, 2017

chiras commented Dec 15, 2017

torognes commented Dec 19, 2017

GeoMicroSoares commented Dec 20, 2017

torognes commented Mar 2, 2018

davidealbanese commented Mar 2, 2018

Andreas-Bio commented Mar 2, 2018 • edited Loading

chiras commented Mar 2, 2018

lanzen commented Mar 2, 2018 via email

torognes commented Mar 8, 2018

Andreas-Bio commented Mar 21, 2018

Leo-alves commented Oct 31, 2018

torognes commented Apr 26, 2024

Andreas-Bio commented Mar 2, 2018 •

edited

Loading