-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-CDS line in gff3 file read as CDS line, required to be divisible by 3 #221
Comments
Tagging @dpark01 as a watcher. |
Hi @lakras, Thanks for reporting! I might need your help to better understand the issue. In the original GFF file the gene It's worth noting that there are many non-compliant and, honestly, simply sloppy genome annotations out there. For example, there are many GFF files in the databases which contain only genes or only CDSes and no gene-CDS pairs at all. In order to increase compatibility with various GFF files, when a gene has 0 CDS attached, Nextclade currently creates a "virtual" CDS for for the entire span of the gene. The assumption is that each gene must have at least one CDS. Otherwise how to translate it? Which may or may not be a correct assumption for all organisms. We don't have experts in HAV in the lab, so we need your help in handling this case correctly. Could you please tell us a little more about:
Thank you! |
@ivan-aksamentov -- first off, sorry this landed here, should we move this issue to the nextclade repo instead? It's true that GFFs and microbial gene annotations can be pretty all over the place in terms of how they represent things and I agree it's weird that this particular one has a That said, it seems ... like not the right behavior to automatically create "virtual" CDSes for genes that never specified any CDSes to begin with. Even if it had been erroneous for the author to omit CDSes, how would you guess where to place them within the gene? (and guessing that wrong is what caused the error in this case). Do you have any examples of other microbial gene annotations where To your question--I think nextclade should ignore any features it doesn't understand, and ignore any In terms of using nextclade for HAV, I think it's not hard to see how to edit the dataset files to make this work, but more broadly, it'd be nice if it worked for off-the-shelf annotations downloaded straight from Genbank that don't seem to have anything truly wrong with them (even if they are odd). Creating a virtual CDS for any |
thanks for the discussion here... For reasons of backward compatibility with (somewhat inconsistent) previous behavior, NextClade treats genes without CDS as CDS... We had the ambition to for it to work with gffs pulled from genbank, but ended up giving up on that. We do have some simple scripts that pull a gff from genbank and write out a minimal gff that nextclade should understand (with some choices whether you want a (hacky script, but might be useful). |
You guys work at all hours huh? (I thought you were CEST or maybe time has no meaning) Well it's easy enough to modify this particular gff to work properly -- we'll probably just drop the "whole genome gene with no products" features and we'll hopefully PR our first dataset here soon. Still curious which old use cases warranted a phantom CDS. Does Nextclade work with EBOV GP/sGP/ssGP, which has three protein products off one gene, with all three sharing the first CDS, and each with a different second CDS (in three different frames, and one of them overlaps the first CDS by 1bp.. which definitely broke snpEff a decade ago)? |
Some of us might be temporarily in different time zones 😃 I suppose we could switch off cds-imputation via a flag, thereby accommodating more stock gffs without breaking backwards compatibility. Yep, the EBOV situation is no problem at all for Nextclade. You just annotated 3 separate CDSes, Nextclade essentially ignores genes (except for using them as parents of CDSes, and apart from the imputation in case a gene doesn't have a CDS). I have an EBOV test dataset here: https://github.com/nextstrain/nextclade_data/blob/ebola/data/nextstrain/ebola/zaire/genome_annotation.gff3 |
Ok thanks for the discussion -- from what I'm seeing we can close this Issue; we know how to make HAV work; I'll let you guys ponder proper behavior or new options/flags for behavior but probably in the other github repo anyway. |
Thanks all! |
gene_changed.gff3.txt
original.gff3.txt
Hi, I'm having to change the gff3 file from NCBI in order to get NextClade to work. It is reading a non-CDS line as CDS.
The input is the gff3 file for NC_001489.1 (attached; changed .gff3 to .gff3.txt so it would upload).
I get the following error: "Length of a CDS is expected to be divisible by 3, but the length of CDS 'HAVgs1' is 7478 (it consists of 1 fragment(s) of length(s) 7478). This is likely a mistake in genome annotation."
In order to get it to work, I need to change "gene 1 7478" to "gene 1 7476" on line 7 (attached)
Thank you so much!
The text was updated successfully, but these errors were encountered: