Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sintax taxonomy classifier #210

Open
davidealbanese opened this issue Oct 26, 2016 · 15 comments
Open

Sintax taxonomy classifier #210

davidealbanese opened this issue Oct 26, 2016 · 15 comments

Comments

@davidealbanese
Copy link
Contributor

Dear @torognes, are you planning to add the sintax classifier http://biorxiv.org/content/early/2016/09/09/074161?

Thank you,
Davide

@torognes
Copy link
Owner

Thanks for the suggestion. We might add some kind of taxonomic classifier to VSEARCH in the future, but there are no firm plans at the moment.

@lanzen
Copy link

lanzen commented Oct 16, 2017

Agree that this would be very useful! We have recently developed a SINTAX formatted version of the SilvaMod database based on Silva and part of CREST (https://github.com/lanzen/CREST/tree/master/LCAClassifier). Unfortunately, it cannot be used without a 64-bit license of usearch since it is too large.

@GeoMicroSoares
Copy link

I'll second this! Think of the opportunities now that Nanopore sequencing is booming.

@chiras
Copy link

chiras commented Dec 15, 2017

@torognes any updates in that regard? would be great to have a hierarchic classification procedure likewise to utax/sintax.

@torognes
Copy link
Owner

I still agree that this would be very useful to include and one of the top features to prioritise, but I do not know when I can find time to implement it.

@GeoMicroSoares
Copy link

We'll be waiting or hopefully someone can contribute useful code meanwhile! :) Thanks @torognes

@torognes
Copy link
Owner

torognes commented Mar 2, 2018

Due to popular demand, I have implemented the sintax command for taxonomic classification.

The sintax command has been added with --sintax_cutoff and --tabbedout options.

It implements the Sintax algorithm as described in Robert Edgar's preprint:

Robert Edgar (2016)
SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences
BioRxiv, 074161
doi: https://doi.org/10.1101/074161

Further details: https://www.drive5.com/usearch/manual/cmd_sintax.html

Multithreading is supported. Databases in UDB files are supported. Strand option may be specified.

This is a new feature that has been only very briefly tested. Feedback is therefore highly welcomed!

@davidealbanese
Copy link
Contributor Author

Great news, thanks! I will test it soon...

@Andreas-Bio
Copy link

Andreas-Bio commented Mar 2, 2018

There are some issues with the (original) sintax command that are prohibiting its use for me (and potentially others):

-the self testing is very unflexible (can only self test the whole database at once against the whole database using LOOCV); instead of LOOCV with selected sequences only
-the algorithm only outputs the first hit irrespective of the hits after that, so it may be an ambiguous hit (top hits identical) and this may not become fully clear, a hit list like BLAST would be much more transparent
-the sintax algorithm is vulnerable to an inconsistent number of sequences per species, if species A has 15 sequences and species B has 1 sequence and that one sequence is identical to species A, than species A will be on the output but almost never species B if the database is queried with species B
-the sintax algorithm is forced length-sensitive, this is unwelcome if there are a lot of partial sequences in the database, as a shorter sequence which is identical (a subset) of a longer sequence will be discriminated against in the results

If you are re-developing the sintax algorithm maybe some of these issues could be resolved very easily.

@chiras
Copy link

chiras commented Mar 2, 2018

Thank you very much, absolutely appreciated!

@lanzen
Copy link

lanzen commented Mar 2, 2018 via email

@torognes
Copy link
Owner

torognes commented Mar 8, 2018

Thanks for your feedback.

So far I have just tried to implement the SINTAX algorithm as described in the preprint. I understand that there is some disagreement about the quality of the algorithm and some issues have been raised. I will look more into these and see if it is possible to improve it or to implement a different algorithm. Please tell me if you have any specific ideas for improvement.

@Andreas-Bio
Copy link

Thanks for your reply.
I think it is important to have an output like BLAST ( https://www.drive5.com/usearch/manual/blast6out.html ) , where you have a list of hits. This enables the user to make an informed decision and at the same time LOOCV is much easier. Additionally it will show ambiguous (identical k-mers between query and hit) hits immediately (rather than having to guess, the situation at the moment). It would make it also much easier to see contaminating sequences (if the whole list is family x but one hit is family y).

Something like: 1) label_query 2) label_hit 3) length_query 4 length_hit 5) percent_similarity_kmers_query_and_hit 6) bootstrap value 7) number of kmers that are not identical ...?
Most of these number must be internally available, and if not, they should be able to be extracted with one line of code.

@Leo-alves
Copy link

Hey there.
Just to drop my five cents...
Have been working myself on improving taxonomic classification using Vsearch (not further sintax algorithm) and I am stuck on that same issues as andzandz11 mentioned.
Deep-level classification, my case species, are at many times just impossible to define, also because of the 16S variant regions we sequence. I am working with a highly curated database which I'am relativelly confident there is no high level of misannotation. Still, for a lot of sequence variants I came across dubious taxonomies, such as the following:

A previously SV was assigned as Bacillus anthracis by vsearch with –maxaccepts 1 (disclaimer: at that point I run vsearch implemented in Qiime2 workflow). I then rerun it against the same db with vsearch outside qiime:
$ vsearch --usearch_global sequence_variant.fa --db db --id 0.99 --blast6out out --maxaccepts 50000
From that I got 811 >=99% id hits. Then, I ranked the taxonomies by percentage of that 811 hits.
Taxonomy | percentage
D_5__Bacillus;D_6__Bacillus_cereus | 36,7
D_5__Bacillus;D_6__Bacillus_sp. | 35,4
D_5__Bacillus;D_6__Bacillus_thuringiensis | 12,1
D_5__Bacillus;D_6__Bacillus_anthracis | 5,9
D_5__Bacillus;D_6__Bacillus_mycoides | 3,6
D_5__Bacillus;D_6__Bacillus_subtilis | 1,1
D_5__Bacillus;D_6__Bacillus_pseudomycoides | 1,0
D_5__Streptococcus;D_6__Streptococcus_pneumoniae | 0,9
D_5__Bacillus;D_6__Bacillus_toyonensis | 0,7
D_5__unclassified_Bacillaceae;D_6__unclassified_Bacillaceae | 0,6
D_5__Enterobacter;D_6__Enterobacter_cloacae | 0,5
D_5__Brevibacillus;D_6__Brevibacillus_brevis | 0,2
D_5__Bacillus;D_6__Bacillus_samanii | 0,2
D_5__Bacillus;D_6__Bacillus_gaemokensis | 0,2
D_5__Staphylococcus;D_6__Staphylococcus_sp. | 0,1
D_5__Staphylococcus;D_6__Staphylococcus_aureus | 0,1
D_5__Paenibacillus;D_6__Paenibacillus_sp. | 0,1
D_5__Bacillus;D_6__Bacillus_pumilus | 0,1
D_5__Bacillus;D_6__Bacillus_marcorestinctum | 0,1
D_5__Bacillus;D_6__Bacillus_amyloliquefaciens | 0,1

Turns out what was B. anthracis looks like more for B. cereus. However, one should consider that a few species actually are so so similar to B. cereus that we have B. cereus group. So, summing all the % of B. cereus group members’ I got 53% B. cereus group. Then, we have the number of taxonomies for each given species in the databank. If there is an enrichment of a taxonomy (exactly like he mentioned) the output tends to deviate to that assignment when ranking. And that is partly happening to this example, because I have 431 B. cereus x 73 B. anthracis seqs in the db. I told partly because 36.7% of 881 = 323 and 5.9% = 52, but 323 is 75% of 431 while 52 is 71% of 73. I actually see a tendency on that, where the %of hits from B. cereus, B. thuringiensis, B. anthracis and B. micoydes are 75-73-71-68%. Well, at the end I would consider this sequence as “B. cereus group” and not B. thuringiensis.
I also noticed that the top hits in a blast6output are reported in order of their positions in the database. That makes sense as vsearch finds an alignment and output it. However, for a sequence similar to more than one species (but not ultra-similar like the B. cereus group) that means the forst one present in the db will likelly be assigned. Because of that a classification system that consider what is related to the top hit fails, at least in my case.
So, what I am doing after ranking is detecting the groups of species I know are closelly related and putting them toghether, as the B. cereus group mentioned. But, if a member of the group (excluding B. cereus) is highly ranked, eg.:60%+, I then accept the taxonomy.
Another serious problem is that for many samples that were sequenced from an isolated bacteria whose species is know by us still return dubious taxonomies, like 50-50% taxA-taxB, being taxA the correct one and for that I just don’t know what to do.

@torognes
Copy link
Owner

I have made several improvements to the sintax command in vsearch 2.28.1, just released. Please see issue #535 or the release notes for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants