Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid output when provided multiple contigs #42

Open
olsonanl opened this issue Nov 22, 2024 · 5 comments
Open

Invalid output when provided multiple contigs #42

olsonanl opened this issue Nov 22, 2024 · 5 comments

Comments

@olsonanl
Copy link

Hi - we had a user submit a genome for annotation with phanotate that had a couple tiny assembly artifact contigs at the end of the file:

>NODE_2_length_113_cov_32.879310
TTGGGGCCCACACCCCAACTTGCATTGCCTGTAGAATTTCTTTTCGAAATTCTCTTTGTT
GGGGCCCACACCCCAACTTGCATTGCCTGTAGAATTTCTTTTCGAAATTCTCT
>NODE_3_length_56_cov_60604.000000
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

The result of running phanotate on this input resulted in an error printed and calls made on the small contig that are outside the range of locations in the data:

Apptainer> phanotate P2SA480_Pro_926.contigs.fasta  > p2.out
Traceback (most recent call last):
  File "/opt/patric-common/runtime/bin/phanotate", line 63, in <module>
    shortest_path = fz.get_path(source=source, target=target)
ValueError: Source node not found

Apptainer> grep -A 10 'id.*NODE_2' p2.out
#id:	NODE_2_length_113_cov_32.879310
#START	STOP	FRAME	CONTIG	SCORE
1217	375	-	NODE_2_length_113_cov_32.879310	-2.903608E+09
2755	1343	-	NODE_2_length_113_cov_32.879310	-3.456958E+09
3017	2742	-	NODE_2_length_113_cov_32.879310	-7.640294E+01
3468	3073	-	NODE_2_length_113_cov_32.879310	-5.967551E+02
4712	3474	-	NODE_2_length_113_cov_32.879310	-4.772831E+12
6623	4725	-	NODE_2_length_113_cov_32.879310	-4.642973E+13
7059	6760	-	NODE_2_length_113_cov_32.879310	-3.205461E+02
7272	7099	-	NODE_2_length_113_cov_32.879310	-1.295295E+01
7653	7276	-	NODE_2_length_113_cov_32.879310	-8.860170E+02

Thanks,
Bob

@deprekate
Copy link
Owner

ah, let me code a catch to gracefully skip such sequences. Oddly I believe it isn't the fact that the sequence is bad, but that its so short (<80), so the "ORF" at 1..56 doesn't get added to the digraph

@olsonanl
Copy link
Author

Excellent - thank you

@deprekate
Copy link
Owner

deprekate commented Nov 26, 2024

I implemented a fix, but before I finalize it, what would be the prefered format for the tabular output?

Would four consecutive comments lines where the first (top) might get mistaken as belonging to the 111..1 orf cause issues?

$ cat test.fasta 
>NODE_3_length_56_cov_60604.000000
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>NODE_2_length_113_cov_32.879310
TTGGGGCCCACACCCCAACTTGCATTGCCTGTAGAATTTCTTTTCGAAATTCTCTTTGTT
GGGGCCCACACCCCAACTTGCATTGCCTGTAGAATTTCTTTTCGAAATTCTCT

$ phanotate.py test.fasta 
#id:	NODE_3_length_56_cov_60604.000000
#START	STOP	FRAME	CONTIG	SCORE
#id:	NODE_2_length_113_cov_32.879310
#START	STOP	FRAME	CONTIG	SCORE
111	1	-	NODE_2_length_113_cov_32.879310	-2.785323E+00

Should I skip sequence entirely from output that do not have and orfs predicted?

@olsonanl
Copy link
Author

I think either headers with no data or skipping headers entirely is reasonable. I don't think my parsing code looks at the headers.

@deprekate
Copy link
Owner

I just pushed the update to both git and pypi.

On a separate topic, I released a programmed ribosomal frameshift predictor, which might be of interest to the BV-BRC

Also I have a brand new gene finder (Genotate) that is more accurate than GeneMark/Glimmer/Prodigal/phanotate, but it is still in a quasi alpha state. It can detect genes with non-canonical start codons, partial genes (such as PRF fragments and inteins), stop codon read-through, as well as completely overlapped nested genes (it correctly predicts 9 of the 10 genes of phiX174)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants