non-CDS line in gff3 file read as CDS line, required to be divisible by 3 #221

lakras · 2024-08-02T20:22:09Z

Hi, I'm having to change the gff3 file from NCBI in order to get NextClade to work. It is reading a non-CDS line as CDS.

The input is the gff3 file for NC_001489.1 (attached; changed .gff3 to .gff3.txt so it would upload).

I get the following error: "Length of a CDS is expected to be divisible by 3, but the length of CDS 'HAVgs1' is 7478 (it consists of 1 fragment(s) of length(s) 7478). This is likely a mistake in genome annotation."

In order to get it to work, I need to change "gene 1 7478" to "gene 1 7476" on line 7 (attached)

Thank you so much!

lakras · 2024-08-02T20:23:01Z

Tagging @dpark01 as a watcher.

ivan-aksamentov · 2024-08-02T21:08:15Z

Hi @lakras,

Thanks for reporting! I might need your help to better understand the issue.

In the original GFF file the gene HAVgs1 does not have a corresponding CDS (the ID of the gene does not appear as a Parent of any CDS). It does have other features associated with it - an exon and a transcript, but Nextclade is not aware of these types of features currently and has no use of them. Currently Nextclade only understands a hierarchy of: gene -> CDS(es) -> peptide.

It's worth noting that there are many non-compliant and, honestly, simply sloppy genome annotations out there. For example, there are many GFF files in the databases which contain only genes or only CDSes and no gene-CDS pairs at all.

In order to increase compatibility with various GFF files, when a gene has 0 CDS attached, Nextclade currently creates a "virtual" CDS for for the entire span of the gene. The assumption is that each gene must have at least one CDS. Otherwise how to translate it? Which may or may not be a correct assumption for all organisms.

We don't have experts in HAV in the lab, so we need your help in handling this case correctly. Could you please tell us a little more about:

In your view, what this GFF file describes exactly? Is it factually correct?
What is your expectation in how the gene HAVgs1 should be handled in Nextclade? In particular, what parts of of the virus need translation. What is the role of the transcript feature here? Perhaps this is something Nextclade needs to deal with.
Is this something specific for HAV, or is applicable more broadly?

Thank you!

dpark01 · 2024-08-02T22:36:46Z

@ivan-aksamentov -- first off, sorry this landed here, should we move this issue to the nextclade repo instead?

It's true that GFFs and microbial gene annotations can be pretty all over the place in terms of how they represent things and I agree it's weird that this particular one has a gene that doesn't do much other than describe the entire genome... it seems a little pointless (it does have a child misc_RNA which is also just the entire genome.. both of these could probably stand to be deleted).

That said, it seems ... like not the right behavior to automatically create "virtual" CDSes for genes that never specified any CDSes to begin with. Even if it had been erroneous for the author to omit CDSes, how would you guess where to place them within the gene? (and guessing that wrong is what caused the error in this case). Do you have any examples of other microbial gene annotations where gene features are missing CDS children, but the correct/desired behavior is to create the CDSes anyway?

To your question--I think nextclade should ignore any features it doesn't understand, and ignore any gene features that lack CDSes. I wouldn't worry about transcript features or things like that.

In terms of using nextclade for HAV, I think it's not hard to see how to edit the dataset files to make this work, but more broadly, it'd be nice if it worked for off-the-shelf annotations downloaded straight from Genbank that don't seem to have anything truly wrong with them (even if they are odd). Creating a virtual CDS for any gene that lacks one seems like a solution that probably breaks more use cases than it helps?

rneher · 2024-08-02T23:26:24Z

thanks for the discussion here...

For reasons of backward compatibility with (somewhat inconsistent) previous behavior, NextClade treats genes without CDS as CDS...

We had the ambition to for it to work with gffs pulled from genbank, but ended up giving up on that. We do have some simple scripts that pull a gff from genbank and write out a minimal gff that nextclade should understand (with some choices whether you want a mature_protein annotation or a CDS annotation etc. You can give this one a try:

https://github.com/nextstrain/nextclade_data/blob/master/docs/example-workflow/scripts/generate_from_genbank.py

(hacky script, but might be useful).

dpark01 · 2024-08-03T12:11:05Z

You guys work at all hours huh? (I thought you were CEST or maybe time has no meaning)

Well it's easy enough to modify this particular gff to work properly -- we'll probably just drop the "whole genome gene with no products" features and we'll hopefully PR our first dataset here soon.

Still curious which old use cases warranted a phantom CDS.

Does Nextclade work with EBOV GP/sGP/ssGP, which has three protein products off one gene, with all three sharing the first CDS, and each with a different second CDS (in three different frames, and one of them overlaps the first CDS by 1bp.. which definitely broke snpEff a decade ago)?

corneliusroemer · 2024-08-03T13:00:09Z

Some of us might be temporarily in different time zones 😃

I suppose we could switch off cds-imputation via a flag, thereby accommodating more stock gffs without breaking backwards compatibility.

Yep, the EBOV situation is no problem at all for Nextclade. You just annotated 3 separate CDSes, Nextclade essentially ignores genes (except for using them as parents of CDSes, and apart from the imputation in case a gene doesn't have a CDS).

I have an EBOV test dataset here: https://github.com/nextstrain/nextclade_data/blob/ebola/data/nextstrain/ebola/zaire/genome_annotation.gff3

dpark01 · 2024-08-03T13:11:08Z

Ok thanks for the discussion -- from what I'm seeing we can close this Issue; we know how to make HAV work; I'll let you guys ponder proper behavior or new options/flags for behavior but probably in the other github repo anyway.

lakras · 2024-08-06T22:41:58Z

Thanks all!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-CDS line in gff3 file read as CDS line, required to be divisible by 3 #221

non-CDS line in gff3 file read as CDS line, required to be divisible by 3 #221

lakras commented Aug 2, 2024

lakras commented Aug 2, 2024

ivan-aksamentov commented Aug 2, 2024

dpark01 commented Aug 2, 2024

rneher commented Aug 2, 2024

dpark01 commented Aug 3, 2024

corneliusroemer commented Aug 3, 2024

dpark01 commented Aug 3, 2024

lakras commented Aug 6, 2024

non-CDS line in gff3 file read as CDS line, required to be divisible by 3 #221

non-CDS line in gff3 file read as CDS line, required to be divisible by 3 #221

Comments

lakras commented Aug 2, 2024

lakras commented Aug 2, 2024

ivan-aksamentov commented Aug 2, 2024

dpark01 commented Aug 2, 2024

rneher commented Aug 2, 2024

dpark01 commented Aug 3, 2024

corneliusroemer commented Aug 3, 2024

dpark01 commented Aug 3, 2024

lakras commented Aug 6, 2024