-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No ORFs for 38kb phage with 33 Prodigal genes #8
Comments
Hm, I am trying to track this down, it appears to be something wrong in fastpathz, which returns 'no path to target'. I have tracked it down to potentially the extremely long ORFs that have huge negative weights in scientific notation (ie -1.2E+40). |
Any additional progress on this? |
Yep, I tracked down the cause of the problem. It is an error with the fastpathz Bellman-Ford algorithm not correctly relaxing the edges. I will have a fix as soon as possible, at least by early next week, if not over the weekend. |
I wanted to update you on this bug. I tracked the original bug down and fixed it. It was due to GMP not having a notion of INFINITY, so I create an 'INFINITY' by setting an arbitrarily large number, which it turned out was not large enough. I made it even larger (1E1000) and it fixed the bug. However to test whether this fix would affect previous runs, I ran the new code on ~4,000 phage genomes. All but one were 100% identical, with the one different run choosing a different start codon. Depending on random chance, the Bellman-Ford will return two different start codons in different runs. This is wrong, since the Bellman-Ford should always return the same path, if the paths are not equal. I have been trying to track it down, I believe it may either be something to do with the GMP math library, or more likely a collision in the UTHASH dictionary implementation. I will continue to work on it in the coming weeks. Cheers |
I tracked down the bug. It turns out it was my fault, I did not correctly null terminate a character array in fastpathz. I have fixed the bug, and pushed updates out. |
I'm still finding cases where PHANNOTATE predicts no genes. Here's one example of a 14.8 Kb contig with 24 Prodigal genes: |
Hm, I can't seem to recreate the error, I have tried on OSX and Linux (CentOS), with pypy/python2.7/python3, and with/without tRNAscanSE. I also ran it 10,000 times in case it was a random occurence type bug. What platform are you on? It might be that the updated fastpathz was not pulled? I know github does not handle included submodules as seamless as they should (the reason for the
|
You're right, it was my mistake this time. I failed to re-install the software properly |
Can you check one more sequence for me: I'm running 1.2.1 but still getting the same error. I'm not sure if this is an error on my end (failed to install latest code) or not. |
It looks like this one does error with the "no path to target" from fastpathz. I'll get a track it down and get a fix pushed out. |
Hi, |
I have not worked on this bug, since it is caused by using the wrong translation table (ie stop codon readthrough). The develop branch of phanotate allows for specifying which stop codons (and start codons) to use. I also just updated the develop branch to allow for up to 500bp overlaps, which might help force through path prediction. |
Hi, thank you for your response. I just ran the develop branch on the same fasta file without changing any input arguments (so no changes in the translation table) and it worked without errors. |
The develop branch has many performance increases since I moved some of the slower python code into C code (because people were using phanotate to annotate bacteria, and it was taking longer). But to do this I had to simplify the weight calculations, which makes the output not 100% identical to the version 1. I think there was one instability bug that I couldn't quite track down #26, but it was mostly caused by merging multiple phage contigs into one big file. When I was testing the accuracy of the 2.0 develop branch to the 1.0 version (on 4k phages), it didn't perform significantly better or worse, just those different 0.1% ORF calls of mostly lower scoring "questionable" ORFS, so I never pushed the 2.0 version to the main branch. I would consider it mostly stable, more of a nightly build than a production build. There are also a few more options, like being able to specify the start and stop codons. |
Thank you. Do you have an estimate, when or if, the develop branch will make it into a release? |
Probably not for another 6 months. I have been really busy on two other computational biology software projects: PRFect and Genotate. PRFect is my new tool that predicts Programmed Ribosomal Frameshifts in the coding genes of a genome, its in review now Genotate isn't public just yet, is another gene finder, except it runs on all domains of life. It is totally different from Phanotate and has some really amazing features as it doesn't use start or stop codons the way all other gene finders (GeneMark, Glimmer, Prodigal, Phanotate) rely upon. Though, because Genotate is supervised learning, Phanotate will still be useful for identifying novel and unknown genes in diverse genomes. |
I've now run phanotate on a large number of phage sequences. Its seems to be producing output in most cases. However for a small subset of sequences, no ORFs are found. These sequences are often >20Kb and contain numerous genes called by Prodigal.
Here is one example you can download and test yourself:
https://www.dropbox.com/s/xxpagf1qvcj6f57/test.fna?dl=0
In this example, the sequence in a 38kb phage with 33 Prodigal genes
Any ideas what might be going on here?
The text was updated successfully, but these errors were encountered: