Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add virAnnot tool #5482

Merged
merged 100 commits into from
Mar 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
153a085
First commit
marieBvr Sep 13, 2023
b0040a0
Flake8 cleaning
marieBvr Sep 13, 2023
3b50a55
Lint R and python files
marieBvr Sep 13, 2023
949247e
Add remote_repository_url
marieBvr Sep 13, 2023
a78a7ac
Fix python lint which I can't reproduce
marieBvr Sep 13, 2023
3c1d12b
Update tools/virAnnot/virAnnot_blast2tsv.xml
marieBvr Sep 18, 2023
01264d1
Update tools/virAnnot/virAnnot_blast2tsv.xml
marieBvr Sep 18, 2023
e585569
Update tools/virAnnot/virAnnot_otu.xml
marieBvr Sep 18, 2023
d9a1bd5
Change xml to blastxml format
marieBvr Sep 18, 2023
75c77bc
Update tools/virAnnot/virAnnot_otu.xml
marieBvr Sep 18, 2023
e329889
Add suite info
marieBvr Sep 18, 2023
c5fc2ee
Simplify citation
marieBvr Sep 18, 2023
428ac7b
Remove unnecessary files
marieBvr Sep 18, 2023
9b00a0f
Fix .shed synthaxe
marieBvr Sep 18, 2023
4dbab3b
Fix .shed synthaxe
marieBvr Sep 19, 2023
d38fdca
reduce size of test data
marieBvr Sep 19, 2023
f5b095a
Fix first test
marieBvr Sep 19, 2023
630b19d
Update tools/virAnnot/virAnnot_tsv2krona.xml
marieBvr Sep 20, 2023
ebfbbb6
Add useful help info
marieBvr Sep 20, 2023
2d2b9d6
Add useful help info for blast2tsv
marieBvr Sep 20, 2023
fbc58ee
Add useful help info for rps2tsv
marieBvr Sep 20, 2023
1d93ac8
Add useful help info for tsv2krona
marieBvr Sep 20, 2023
2de1eef
Fix missing single quote
marieBvr Sep 20, 2023
8592efd
Remove single quote from for loop
marieBvr Sep 20, 2023
f3beb5c
Change purpose of script: generate abundance files to be used by exis…
marieBvr Oct 3, 2023
4eb0576
Remove krona test unnecessary data
marieBvr Oct 3, 2023
c5dd661
Update rps2tsv to use Interpro instead of pfam because pfam wbesite i…
marieBvr Oct 25, 2023
e609d17
Adapt blast2tsv XML to abundance files
marieBvr Oct 26, 2023
3a2ccb2
Update blast2tsv test data
marieBvr Oct 26, 2023
42bfc64
Remove blast2tsv old test data
marieBvr Oct 26, 2023
4f9946b
Fix flake8 messages which I can't reproduce
marieBvr Oct 26, 2023
da4a65b
Fix tool output definition equal to input name
marieBvr Nov 7, 2023
474f609
Fix tool output name
marieBvr Nov 8, 2023
0df60b8
Update data for tests
marieBvr Nov 8, 2023
18f8455
Change input select to integer
marieBvr Nov 8, 2023
dcd567b
Fix typo and quotes
marieBvr Nov 8, 2023
eaf6a23
Fix fasta sequence KeyError
marieBvr Nov 8, 2023
fe58fc8
Fix renaming of data test files for test
marieBvr Nov 8, 2023
6deed67
Specific requirements does not work
marieBvr Jan 29, 2024
b6ff2ee
Fix requirements call
marieBvr Jan 29, 2024
a23f8d4
this combination should work
bgruening Jan 29, 2024
89402ba
Add missing requirement openpyxl
marieBvr Jan 29, 2024
089c19c
Try fixing lintr error messages
marieBvr Feb 14, 2024
2f2724d
Try fixing lintr error messages indent
marieBvr Feb 14, 2024
6c4532d
Update test data for rps2tsv
marieBvr Feb 16, 2024
8886f1a
Try fixing R indentation
marieBvr Feb 16, 2024
b225c34
Try fixing R lint
marieBvr Feb 16, 2024
e41cb86
First commit
marieBvr Sep 13, 2023
c38d8a5
Flake8 cleaning
marieBvr Sep 13, 2023
761f6b0
Lint R and python files
marieBvr Sep 13, 2023
703c807
Add remote_repository_url
marieBvr Sep 13, 2023
61a1b36
Fix python lint which I can't reproduce
marieBvr Sep 13, 2023
b370750
Update tools/virAnnot/virAnnot_blast2tsv.xml
marieBvr Sep 18, 2023
ec00b85
Update tools/virAnnot/virAnnot_blast2tsv.xml
marieBvr Sep 18, 2023
57b7b85
Update tools/virAnnot/virAnnot_otu.xml
marieBvr Sep 18, 2023
471e8f6
Change xml to blastxml format
marieBvr Sep 18, 2023
733e542
Update tools/virAnnot/virAnnot_otu.xml
marieBvr Sep 18, 2023
7910443
Add suite info
marieBvr Sep 18, 2023
8583500
Simplify citation
marieBvr Sep 18, 2023
174ff0c
Remove unnecessary files
marieBvr Sep 18, 2023
9800db6
Fix .shed synthaxe
marieBvr Sep 18, 2023
e2187e9
Fix .shed synthaxe
marieBvr Sep 19, 2023
677b971
reduce size of test data
marieBvr Sep 19, 2023
8837e3a
Fix first test
marieBvr Sep 19, 2023
4084428
Update tools/virAnnot/virAnnot_tsv2krona.xml
marieBvr Sep 20, 2023
c9b813c
Add useful help info
marieBvr Sep 20, 2023
438d351
Add useful help info for blast2tsv
marieBvr Sep 20, 2023
c98b4e7
Add useful help info for rps2tsv
marieBvr Sep 20, 2023
fb0362f
Add useful help info for tsv2krona
marieBvr Sep 20, 2023
457cdee
Fix missing single quote
marieBvr Sep 20, 2023
75806f3
Remove single quote from for loop
marieBvr Sep 20, 2023
a6930c9
Change purpose of script: generate abundance files to be used by exis…
marieBvr Oct 3, 2023
ac0291b
Remove krona test unnecessary data
marieBvr Oct 3, 2023
4043f75
Update rps2tsv to use Interpro instead of pfam because pfam wbesite i…
marieBvr Oct 25, 2023
2b869d8
Adapt blast2tsv XML to abundance files
marieBvr Oct 26, 2023
8c0a152
Update blast2tsv test data
marieBvr Oct 26, 2023
6ab733d
Remove blast2tsv old test data
marieBvr Oct 26, 2023
eb88610
Fix flake8 messages which I can't reproduce
marieBvr Oct 26, 2023
5e9a648
Fix tool output definition equal to input name
marieBvr Nov 7, 2023
c4efd6b
Fix tool output name
marieBvr Nov 8, 2023
253f3b6
Update data for tests
marieBvr Nov 8, 2023
f0ae86d
Change input select to integer
marieBvr Nov 8, 2023
cc8763c
Fix typo and quotes
marieBvr Nov 8, 2023
f978a49
Fix fasta sequence KeyError
marieBvr Nov 8, 2023
b70f952
Fix renaming of data test files for test
marieBvr Nov 8, 2023
9899fe0
Specific requirements does not work
marieBvr Jan 29, 2024
ae77fdf
Fix requirements call
marieBvr Jan 29, 2024
081eabf
this combination should work
bgruening Jan 29, 2024
d56c013
Add missing requirement openpyxl
marieBvr Jan 29, 2024
e8f3bbf
Try fixing lintr error messages
marieBvr Feb 14, 2024
b4d0458
Try fixing lintr error messages indent
marieBvr Feb 14, 2024
17b8186
Update test data for rps2tsv
marieBvr Feb 16, 2024
037685a
Try fixing R indentation
marieBvr Feb 16, 2024
8c83ece
Try fixing R lint
marieBvr Feb 16, 2024
70f11df
misc changes
bgruening Feb 26, 2024
1a7441a
add zip as dependency
bgruening Feb 26, 2024
98797d8
add it to macro to save one container
bgruening Feb 26, 2024
0c7ef58
lint fixes
bgruening Feb 26, 2024
24075a4
Merge fixing
marieBvr Feb 29, 2024
d518f56
Clean test-data
marieBvr Feb 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions tools/virAnnot/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
categories:
marieBvr marked this conversation as resolved.
Show resolved Hide resolved
- Metagenomics
description: virAnnot wrappers
homepage_url: https://github.com/marieBvr/virAnnot
type: unrestricted
long_description: VirAnnot was build to ease the assembly, blast search, taxonomic
annotation and OTUs assignation of viral metagenomic NGS data. It is designed to
identify viruses in plant metagenomic data but it can be used to assemble and annotate
any sequences with the NCBI taxonomy.
name: virannot
owner: melefebvre
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/virAnnot
auto_tool_repositories:
name_template: "{{ tool_id }}"
description_template: "Wrapper for the anndata tool suite: {{ tool_name }}"
suite:
name: "suite_virannot"
description: "Add taxonomic info to HTS data and assign viral OTUs"
long_description: "VirAnnot was build to ease the assembly, blast search, taxonomic
annotation and OTUs assignation of viral metagenomic NGS data. It is designed to
identify viruses in plant metagenomic data but it can be used to assemble and annotate
any sequences with the NCBI taxonomy."
284 changes: 284 additions & 0 deletions tools/virAnnot/blast2tsv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
#!/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@marieBvr marieBvr Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bgruening,
Thanks for your suggestion. I had look at this tool before implementing the virannot one.
Even if there is the annotation, it is missing the taxonomic info like Viruses;Riboviria;Orthornavirae;Kitrinoviricota;Alsuviricetes;Martellivirales;Bromoviridae;Ilarvirus;Blackberry chlorotic ringspot virus.
In the near future I plan to only use the annotation for virAnnot, but that will be for the next version.



# Name: blast2tsv
# Author(s): Sebastien Theil, Marie Lefebvre - INRAE
# Aims: Convert blast xml output to tsv and add taxonomy


import argparse
import csv
import logging as log
import os

from Bio import Entrez
from Bio import SeqIO
from Bio.Blast import NCBIXML
from ete3 import NCBITaxa

ncbi = NCBITaxa()


def main():
options = _set_options()
_set_log_level(options.verbosity)
hits = _read_xml(options)
_write_tsv(options, hits)


def _guess_database(accession):
"""Guess the correct database for querying based off the format of the accession"""
database_mappings_refseq = {'AC_': 'nuccore', 'NC_': 'nuccore', 'NG_': 'nuccore',
'NT_': 'nuccore', 'NW_': 'nuccore', 'NZ_': 'nuccore',
'AP_': 'protein', 'NP_': 'protein', 'YP_': 'protein',
'XP_': 'protein', 'WP_': 'protein'}
return database_mappings_refseq[accession[0:3]]


def _read_xml(options):
"""
Parse XML blast results file
Keep only the first hit
"""
log.info("Read XML file.")
results = open(options.xml_file, 'r')
records = NCBIXML.parse(results)
xml_results = {}
for blast_record in records:
for aln in blast_record.alignments:
hit_count = 1
for hit in aln.hsps:
hsp = {}
if hit_count == 1:
first_hit_frame = hit.frame[1] if len(hit.frame) > 0 else 0 # strand
cumul_hit_identity = hit.identities if hit.identities else 0
cumul_hit_score = hit.bits # hit score
cumul_hit_evalue = hit.expect # evalue
cumul_hit_length = hit.align_length if hit.align_length is not None else 0
hit_count = hit_count + 1
else:
# all HSPs in different strand than 1st HSPs will be discarded.
if (first_hit_frame > 0 and hit.frame[1] > 0) or (first_hit_frame < 0 and hit.frame[1] < 0):
cumul_hit_identity = cumul_hit_identity + hit.identities
cumul_hit_length = cumul_hit_length + hit.align_length
cumul_hit_evalue = cumul_hit_evalue + hit.expect
cumul_hit_score = cumul_hit_score + hit.bits
hit_count = hit_count + 1
if hit_count == 1:
final_hit_count = hit_count
elif hit_count > 1:
final_hit_count = hit_count - 1
hsp["evalue"] = cumul_hit_evalue / final_hit_count # The smaller the E-value, the better the match
hsp["query_id"] = blast_record.query_id
hsp["query_length"] = blast_record.query_length # length of the query
hsp["accession"] = aln.accession.replace("ref|", "")
hsp["description"] = aln.hit_def
hsp["hit_length"] = aln.length # length of the hit
hsp["hsp_length"] = hit.align_length # length of the hsp alignment
hsp["queryOverlap"] = _get_overlap_value(options.algo, hsp, 'hsp', hsp["query_length"])[0]
if cumul_hit_length == 0:
hsp["percentIdentity"] = round(cumul_hit_identity, 1) # identity percentage
else:
hsp["percentIdentity"] = round(cumul_hit_identity / cumul_hit_length * 100, 1) # identity percentage
hsp["score"] = cumul_hit_score # The higher the bit-score, the better the sequence similarity
hsp["num_hsps"] = final_hit_count
hsp["hit_cumul_length"] = cumul_hit_length
hsp["hitOverlap"] = _get_overlap_value(options.algo, hsp, 'hit', hsp["query_length"])[1]
db = _guess_database(hsp["accession"])
try:
handle = Entrez.esummary(db=db, id=hsp["accession"])
taxid = str(int(Entrez.read(handle)[0]['TaxId']))
handle.close()
log.info("Taxid found for " + hsp["accession"])
lineage = ncbi.get_lineage(taxid)
names = ncbi.get_taxid_translator(lineage)
ordered = [names[tid] for tid in lineage]
taxonomy = ordered[1:]
hsp["tax_id"] = taxid
hsp["taxonomy"] = ';'.join(taxonomy)
hsp["organism"] = taxonomy[-1]
except RuntimeError:
hsp["tax_id"] = ""
hsp["taxonomy"] = ""
hsp["organism"] = ""
log.warning("RuntimeError - Taxid not found for " + hsp["accession"])
if hsp["evalue"] <= options.max_evalue and hsp["queryOverlap"] >= options.min_qov and \
hsp["hitOverlap"] >= options.min_hov and hsp["score"] >= options.min_score:
xml_results[hsp["query_id"]] = hsp
else:
xml_results[hsp["query_id"]] = [hsp["query_length"]]

return xml_results


def _get_overlap_value(algo, hsp, type, qlength):
"""
Set hsp or hit overlap values for hit and query
Return array [query_overlap, hit_overlap]
"""
if type == 'hsp':
q_align_len = qlength
h_align_len = hsp["hsp_length"]
else:
q_align_len = qlength
h_align_len = hsp["hit_cumul_length"]

if algo == 'BLASTX':
if q_align_len:
query_overlap = (q_align_len * 3 / q_align_len) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len / hsp["hit_length"]) * 100
elif algo == 'TBLASTN':
if q_align_len:
query_overlap = (q_align_len / q_align_len) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len * 3 / hsp["hit_length"]) * 100
elif algo == 'TBLASTX':
if q_align_len:
query_overlap = (q_align_len * 3 / hsp["hsp_length"]) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len * 3 / hsp["hit_length"]) * 100
else:
if q_align_len:
query_overlap = (q_align_len / q_align_len) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len / hsp["hit_length"]) * 100
if query_overlap is None:
query_overlap = 0
if query_overlap > 100:
query_overlap = 100
if 'hit_overlap' not in locals():
hit_overlap = 0
if hit_overlap > 100:
hit_overlap = 100

return [round(query_overlap, 0), round(hit_overlap, 0)]


def _write_tsv(options, hits):
"""
Write output
"""
# get a list of contig without corresponding number of mapped reads
if options.rn_file is not None:
with open(options.rn_file) as rn:
rows = (line.split('\t') for line in rn)
rn_list = {row[0]: row[1:] for row in rows}
fasta = SeqIO.to_dict(SeqIO.parse(open(options.fasta_file), 'fasta'))
headers = "#algo\tquery_id\tnb_reads\tquery_length\taccession\tdescription\torganism\tpercentIdentity\tnb_hsps\tqueryOverlap\thitOverlap\tevalue\tscore\ttax_id\ttaxonomy\tsequence\n"
if not os.path.exists(options.output):
os.mkdir(options.output)
tsv_file = options.output + "/blast2tsv_output.tab"
log.info("Write output file: " + tsv_file)
f = open(tsv_file, "w+")
f.write(headers)
for h in hits:
if options.rn_file is not None:
read_nb = ''.join(rn_list[h]).replace("\n", "")
else:
read_nb = ''
if len(hits[h]) > 1:
f.write(options.algo + "\t" + h + "\t" + read_nb + "\t" + str(hits[h]["query_length"]) + "\t")
f.write(hits[h]["accession"] + "\t" + hits[h]["description"] + "\t")
f.write(hits[h]["organism"] + "\t" + str(hits[h]["percentIdentity"]) + "\t")
f.write(str(hits[h]["num_hsps"]) + "\t" + str(hits[h]["queryOverlap"]) + "\t")
f.write(str(hits[h]["hitOverlap"]) + "\t" + str(hits[h]["evalue"]) + "\t")
f.write(str(hits[h]["score"]) + "\t" + str(hits[h]["tax_id"]) + "\t")
if h in fasta:
f.write(hits[h]["taxonomy"] + "\t" + str(fasta[h].seq))
else:
f.write(hits[h]["taxonomy"] + "\t\"\"")
f.write("\n")
else:
f.write(options.algo + "\t" + h + "\t" + read_nb + "\t" + str(hits[h])[1:-1] + "\t")
f.write("\n")
f.close()
_create_abundance(options, tsv_file)


def _create_abundance(options, tsv_file):
"""
extract values from tsv files
and create abundance files
"""
log.info("Calculating abundance.")
file_path = tsv_file
abundance = dict()
with open(tsv_file, 'r') as current_file:
log.debug("Reading " + file_path)
csv_reader = csv.reader(current_file, delimiter='\t')
line_count = 0
for row in csv_reader:
if line_count == 0:
# headers
line_count += 1
else:
# no annotation
if len(row) == 16:
if row[14] != "":
nb_reads = row[2]
if nb_reads == "":
current_reads_nb = 0
log.debug("No reads number for " + row[1])
else:
current_reads_nb = int(nb_reads)
contig_id = row[14]
if contig_id in abundance:
# add reads
abundance[contig_id]["reads_nb"] = abundance[row[14]]["reads_nb"] + current_reads_nb
abundance[contig_id]["contigs_nb"] = abundance[row[14]]["contigs_nb"] + 1
else:
# init reads for this taxo
abundance[contig_id] = {}
abundance[contig_id]["reads_nb"] = current_reads_nb
abundance[contig_id]["contigs_nb"] = 1
else:
log.debug("No annotations for contig " + row[1])
else:
log.debug("No annotations for contig " + row[1])
log.debug(abundance)
reads_file = open(options.output + "/blast2tsv_reads.txt", "w+")
for taxo in abundance:
reads_file.write(str(abundance[taxo]["reads_nb"]))
reads_file.write("\t")
reads_file.write("\t".join(taxo.split(";")))
reads_file.write("\n")
reads_file.close()
log.info("Abundance file created " + options.output + "/blast2tsv_reads.txt")
contigs_file = open(options.output + "/blast2tsv_contigs.txt", "w+")
for taxo in abundance:
contigs_file.write(str(abundance[taxo]["contigs_nb"]))
contigs_file.write("\t")
contigs_file.write("\t".join(taxo.split(";")))
contigs_file.write("\n")
contigs_file.close()
log.info("Abundance file created " + options.output + "/blast2tsv_contigs.txt")


def _set_options():
parser = argparse.ArgumentParser()
parser.add_argument('-x', '--xml', help='XML files with results of blast', action='store', required=True, dest='xml_file')
parser.add_argument('-rn', '--read-count', help='Tab-delimited file associating seqID with read number.', action='store', dest='rn_file')
parser.add_argument('-c', '--contigs', help='FASTA file with contigs sequence.', action='store', required=True, dest='fasta_file')
parser.add_argument('-me', '--max_evalue', help='Max evalue', action='store', type=float, default=0.0001, dest='max_evalue')
parser.add_argument('-qov', '--min_query_overlap', help='Minimum query overlap', action='store', type=int, default=5, dest='min_qov')
parser.add_argument('-mhov', '--min_hit_overlap', help='Minimum hit overlap', action='store', type=int, default=5, dest='min_hov')
parser.add_argument('-s', '--min_score', help='Minimum score', action='store', type=int, default=30, dest='min_score')
parser.add_argument('-a', '--algo', help='Blast type detection (BLASTN|BLASTP|BLASTX|TBLASTX|TBLASTN|DIAMONDX).', action='store', type=str, default='BLASTX', dest='algo')
parser.add_argument('-o', '--out', help='The output file (.csv).', action='store', type=str, default='./blast2tsv', dest='output')
parser.add_argument('-v', '--verbosity', help='Verbose level', action='store', type=int, choices=[1, 2, 3, 4], default=1)
args = parser.parse_args()
return args


def _set_log_level(verbosity):
if verbosity == 1:
log_format = '%(asctime)s %(levelname)-8s %(message)s'
log.basicConfig(level=log.INFO, format=log_format)
elif verbosity == 3:
log_format = '%(filename)s:%(lineno)s - %(asctime)s %(levelname)-8s %(message)s'
log.basicConfig(level=log.DEBUG, format=log_format)


if __name__ == "__main__":
main()
25 changes: 25 additions & 0 deletions tools/virAnnot/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<macros>
<xml name="requirements">
<requirements>
<requirement type="package" version="1.83">biopython</requirement>
<requirement type="package" version="3.1.3">ete3</requirement>
<requirement type="package" version="1.2.4">clustalo</requirement>
<requirement type="package" version="8.5.0">curl</requirement>
<requirement type="package" version="4.3.2">r-base</requirement>
<requirement type="package" version="23.12.0">pyaml</requirement>
<requirement type="package" version="3.1.2">openpyxl</requirement>
<requirement type="package" version="3.1.9">xlsxwriter</requirement>
<requirement type="package" version="2.0.1">xlrd</requirement>
<requirement type="package" version="2.2.0">pandas</requirement>
<requirement type="package" version="2.8.1">krona</requirement>
<requirement type="package" version="3.0">zip</requirement>
<yield />
</requirements>
</xml>
<token name="@HEADLESS@"><![CDATA[export QT_QPA_PLATFORM='offscreen' &&]]></token>
<xml name="citations">
<citations>
<citation type="doi">10.1094/PBIOMES-07-19-0037-A</citation>
</citations>
</xml>
</macros>
Loading