Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-CDS line in gff3 file read as CDS line, required to be divisible by 3 #221

Open
lakras opened this issue Aug 2, 2024 · 8 comments
Open

Comments

@lakras
Copy link

lakras commented Aug 2, 2024

gene_changed.gff3.txt
original.gff3.txt

Hi, I'm having to change the gff3 file from NCBI in order to get NextClade to work. It is reading a non-CDS line as CDS.

The input is the gff3 file for NC_001489.1 (attached; changed .gff3 to .gff3.txt so it would upload).

I get the following error: "Length of a CDS is expected to be divisible by 3, but the length of CDS 'HAVgs1' is 7478 (it consists of 1 fragment(s) of length(s) 7478). This is likely a mistake in genome annotation."

In order to get it to work, I need to change "gene 1 7478" to "gene 1 7476" on line 7 (attached)

Thank you so much!

@lakras
Copy link
Author

lakras commented Aug 2, 2024

Tagging @dpark01 as a watcher.

@ivan-aksamentov
Copy link
Member

Hi @lakras,

Thanks for reporting! I might need your help to better understand the issue.

In the original GFF file the gene HAVgs1 does not have a corresponding CDS (the ID of the gene does not appear as a Parent of any CDS). It does have other features associated with it - an exon and a transcript, but Nextclade is not aware of these types of features currently and has no use of them. Currently Nextclade only understands a hierarchy of: gene -> CDS(es) -> peptide.

It's worth noting that there are many non-compliant and, honestly, simply sloppy genome annotations out there. For example, there are many GFF files in the databases which contain only genes or only CDSes and no gene-CDS pairs at all.

In order to increase compatibility with various GFF files, when a gene has 0 CDS attached, Nextclade currently creates a "virtual" CDS for for the entire span of the gene. The assumption is that each gene must have at least one CDS. Otherwise how to translate it? Which may or may not be a correct assumption for all organisms.

We don't have experts in HAV in the lab, so we need your help in handling this case correctly. Could you please tell us a little more about:

  • In your view, what this GFF file describes exactly? Is it factually correct?

  • What is your expectation in how the gene HAVgs1 should be handled in Nextclade? In particular, what parts of of the virus need translation. What is the role of the transcript feature here? Perhaps this is something Nextclade needs to deal with.

  • Is this something specific for HAV, or is applicable more broadly?

Thank you!

@dpark01
Copy link

dpark01 commented Aug 2, 2024

@ivan-aksamentov -- first off, sorry this landed here, should we move this issue to the nextclade repo instead?

It's true that GFFs and microbial gene annotations can be pretty all over the place in terms of how they represent things and I agree it's weird that this particular one has a gene that doesn't do much other than describe the entire genome... it seems a little pointless (it does have a child misc_RNA which is also just the entire genome.. both of these could probably stand to be deleted).

That said, it seems ... like not the right behavior to automatically create "virtual" CDSes for genes that never specified any CDSes to begin with. Even if it had been erroneous for the author to omit CDSes, how would you guess where to place them within the gene? (and guessing that wrong is what caused the error in this case). Do you have any examples of other microbial gene annotations where gene features are missing CDS children, but the correct/desired behavior is to create the CDSes anyway?

To your question--I think nextclade should ignore any features it doesn't understand, and ignore any gene features that lack CDSes. I wouldn't worry about transcript features or things like that.

In terms of using nextclade for HAV, I think it's not hard to see how to edit the dataset files to make this work, but more broadly, it'd be nice if it worked for off-the-shelf annotations downloaded straight from Genbank that don't seem to have anything truly wrong with them (even if they are odd). Creating a virtual CDS for any gene that lacks one seems like a solution that probably breaks more use cases than it helps?

@rneher
Copy link
Member

rneher commented Aug 2, 2024

thanks for the discussion here...

For reasons of backward compatibility with (somewhat inconsistent) previous behavior, NextClade treats genes without CDS as CDS...

We had the ambition to for it to work with gffs pulled from genbank, but ended up giving up on that. We do have some simple scripts that pull a gff from genbank and write out a minimal gff that nextclade should understand (with some choices whether you want a mature_protein annotation or a CDS annotation etc. You can give this one a try:

https://github.com/nextstrain/nextclade_data/blob/master/docs/example-workflow/scripts/generate_from_genbank.py

(hacky script, but might be useful).

@dpark01
Copy link

dpark01 commented Aug 3, 2024

You guys work at all hours huh? (I thought you were CEST or maybe time has no meaning)

Well it's easy enough to modify this particular gff to work properly -- we'll probably just drop the "whole genome gene with no products" features and we'll hopefully PR our first dataset here soon.

Still curious which old use cases warranted a phantom CDS.

Does Nextclade work with EBOV GP/sGP/ssGP, which has three protein products off one gene, with all three sharing the first CDS, and each with a different second CDS (in three different frames, and one of them overlaps the first CDS by 1bp.. which definitely broke snpEff a decade ago)?

@corneliusroemer
Copy link
Member

Some of us might be temporarily in different time zones 😃

I suppose we could switch off cds-imputation via a flag, thereby accommodating more stock gffs without breaking backwards compatibility.

Yep, the EBOV situation is no problem at all for Nextclade. You just annotated 3 separate CDSes, Nextclade essentially ignores genes (except for using them as parents of CDSes, and apart from the imputation in case a gene doesn't have a CDS).

I have an EBOV test dataset here: https://github.com/nextstrain/nextclade_data/blob/ebola/data/nextstrain/ebola/zaire/genome_annotation.gff3

@dpark01
Copy link

dpark01 commented Aug 3, 2024

Ok thanks for the discussion -- from what I'm seeing we can close this Issue; we know how to make HAV work; I'll let you guys ponder proper behavior or new options/flags for behavior but probably in the other github repo anyway.

@lakras
Copy link
Author

lakras commented Aug 6, 2024

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants