Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No ORFs for 38kb phage with 33 Prodigal genes #8

Closed
snayfach opened this issue May 17, 2019 · 17 comments
Closed

No ORFs for 38kb phage with 33 Prodigal genes #8

snayfach opened this issue May 17, 2019 · 17 comments

Comments

@snayfach
Copy link

I've now run phanotate on a large number of phage sequences. Its seems to be producing output in most cases. However for a small subset of sequences, no ORFs are found. These sequences are often >20Kb and contain numerous genes called by Prodigal.

Here is one example you can download and test yourself:
https://www.dropbox.com/s/xxpagf1qvcj6f57/test.fna?dl=0

In this example, the sequence in a 38kb phage with 33 Prodigal genes

Any ideas what might be going on here?

@deprekate
Copy link
Owner

Hm, I am trying to track this down, it appears to be something wrong in fastpathz, which returns 'no path to target'. I have tracked it down to potentially the extremely long ORFs that have huge negative weights in scientific notation (ie -1.2E+40).

@snayfach
Copy link
Author

Any additional progress on this?

@deprekate
Copy link
Owner

Yep, I tracked down the cause of the problem. It is an error with the fastpathz Bellman-Ford algorithm not correctly relaxing the edges. I will have a fix as soon as possible, at least by early next week, if not over the weekend.

@deprekate
Copy link
Owner

I wanted to update you on this bug. I tracked the original bug down and fixed it. It was due to GMP not having a notion of INFINITY, so I create an 'INFINITY' by setting an arbitrarily large number, which it turned out was not large enough. I made it even larger (1E1000) and it fixed the bug.

However to test whether this fix would affect previous runs, I ran the new code on ~4,000 phage genomes. All but one were 100% identical, with the one different run choosing a different start codon. Depending on random chance, the Bellman-Ford will return two different start codons in different runs. This is wrong, since the Bellman-Ford should always return the same path, if the paths are not equal.

I have been trying to track it down, I believe it may either be something to do with the GMP math library, or more likely a collision in the UTHASH dictionary implementation. I will continue to work on it in the coming weeks.

Cheers

@deprekate
Copy link
Owner

I tracked down the bug. It turns out it was my fault, I did not correctly null terminate a character array in fastpathz. I have fixed the bug, and pushed updates out.

@snayfach
Copy link
Author

I'm still finding cases where PHANNOTATE predicts no genes. Here's one example of a 14.8 Kb contig with 24 Prodigal genes:
https://www.dropbox.com/s/r5792v92kqqhlcu/test3.fna

@deprekate
Copy link
Owner

Hm, I can't seem to recreate the error, I have tried on OSX and Linux (CentOS), with pypy/python2.7/python3, and with/without tRNAscanSE. I also ran it 10,000 times in case it was a random occurence type bug.

What platform are you on?

It might be that the updated fastpathz was not pulled? I know github does not handle included submodules as seamless as they should (the reason for the --recursive flag in the git clone). You can see if manually pulling the fastpast submodule fixes the error:

git submodule update --recursive

@snayfach
Copy link
Author

You're right, it was my mistake this time. I failed to re-install the software properly

@snayfach
Copy link
Author

Can you check one more sequence for me:
https://www.dropbox.com/s/a91h60wyqb2yzc8/test.fna

I'm running 1.2.1 but still getting the same error. I'm not sure if this is an error on my end (failed to install latest code) or not.

@deprekate
Copy link
Owner

It looks like this one does error with the "no path to target" from fastpathz. I'll get a track it down and get a fix pushed out.

@deprekate
Copy link
Owner

Hi, sorry for the delay. I have tracked down the problem, and it is because there is a region in the sequence that is "essentially" non-coding and greater than the 300bp connection cutoff. The reason I say "essentially" is that the region is actually spanned by a spurious ORF (in the +3 frame), however this spurious ORF does not get added to the path (in red) because it overlaps by more than 300bp with the next ORF that would be added to the path (the right red ORF).
nopath_region

I could tell right away that the middle red ORF should actually extend to the right, I figured whatever this genome was it used a non-traditional start codon somewhere around 20100. In actuality I realized that this genome has TAG stop codon read through. and that the three red ORFs are most likely one single giant ORF.

I am trying to figure out the best way for PHANOTATE to handle such a situation.

@ilyavs
Copy link

ilyavs commented Nov 12, 2022

Hi,
@deprekate any news on this? I encountered a similar error from fastpathz.
Any way to skip the problematic ORF and still get the output for the rest of the genes?
Thanks,
Ilya.

@deprekate
Copy link
Owner

@ilyavs,

I have not worked on this bug, since it is caused by using the wrong translation table (ie stop codon readthrough). The develop branch of phanotate allows for specifying which stop codons (and start codons) to use.

I also just updated the develop branch to allow for up to 500bp overlaps, which might help force through path prediction.

@ilyavs
Copy link

ilyavs commented Dec 16, 2022

Hi, thank you for your response. I just ran the develop branch on the same fasta file without changing any input arguments (so no changes in the translation table) and it worked without errors.
Would you recommend using the develop branch for routine analysis? Is it considered stable?
Thanks,
Ilya.

@deprekate
Copy link
Owner

The develop branch has many performance increases since I moved some of the slower python code into C code (because people were using phanotate to annotate bacteria, and it was taking longer). But to do this I had to simplify the weight calculations, which makes the output not 100% identical to the version 1.

I think there was one instability bug that I couldn't quite track down #26, but it was mostly caused by merging multiple phage contigs into one big file.

When I was testing the accuracy of the 2.0 develop branch to the 1.0 version (on 4k phages), it didn't perform significantly better or worse, just those different 0.1% ORF calls of mostly lower scoring "questionable" ORFS, so I never pushed the 2.0 version to the main branch. I would consider it mostly stable, more of a nightly build than a production build.

There are also a few more options, like being able to specify the start and stop codons.

@ilyavs
Copy link

ilyavs commented Dec 16, 2022

Thank you. Do you have an estimate, when or if, the develop branch will make it into a release?

@deprekate
Copy link
Owner

Probably not for another 6 months. I have been really busy on two other computational biology software projects: PRFect and Genotate.

PRFect is my new tool that predicts Programmed Ribosomal Frameshifts in the coding genes of a genome, its in review now

Genotate isn't public just yet, is another gene finder, except it runs on all domains of life. It is totally different from Phanotate and has some really amazing features as it doesn't use start or stop codons the way all other gene finders (GeneMark, Glimmer, Prodigal, Phanotate) rely upon. Though, because Genotate is supervised learning, Phanotate will still be useful for identifying novel and unknown genes in diverse genomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants