Skip to content

Commit

Permalink
Add virAnnot tool (#5482)
Browse files Browse the repository at this point in the history
* First commit

* Flake8 cleaning

* Lint R and python files

* Add remote_repository_url

* Fix python lint which I can't reproduce

* Update tools/virAnnot/virAnnot_blast2tsv.xml

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/virAnnot/virAnnot_blast2tsv.xml

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/virAnnot/virAnnot_otu.xml

Co-authored-by: Björn Grüning <[email protected]>

* Change xml to blastxml format

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/virAnnot/virAnnot_otu.xml

Co-authored-by: Björn Grüning <[email protected]>

* Add suite info

* Simplify citation

* Remove unnecessary files

* Fix .shed synthaxe

* Fix .shed synthaxe

* reduce size of test data

* Fix first test

* Update tools/virAnnot/virAnnot_tsv2krona.xml

Co-authored-by: Björn Grüning <[email protected]>

* Add useful help info

* Add useful help info for blast2tsv

* Add useful help info for rps2tsv

* Add useful help info for tsv2krona

* Fix missing single quote

* Remove single quote from for loop

* Change purpose of script: generate abundance files to be used by existing krona

* Remove krona test unnecessary data

* Update rps2tsv to use Interpro instead of pfam because pfam wbesite is decommissioned

* Adapt blast2tsv XML to abundance files

* Update blast2tsv test data

* Remove blast2tsv old test data

* Fix flake8 messages which I can't reproduce

* Fix tool output definition equal to input name

* Fix tool output name

* Update data for tests

* Change input select to integer

* Fix typo and quotes

* Fix fasta sequence KeyError

* Fix renaming of data test files for test

* Specific requirements does not work

* Fix requirements call

* this combination should work

* Add missing requirement openpyxl

* Try fixing lintr error messages

* Try fixing lintr error messages indent

* Update test data for rps2tsv

* Try fixing R indentation

* Try fixing R lint

* First commit

* Flake8 cleaning

* Lint R and python files

* Add remote_repository_url

* Fix python lint which I can't reproduce

* Update tools/virAnnot/virAnnot_blast2tsv.xml

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/virAnnot/virAnnot_blast2tsv.xml

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/virAnnot/virAnnot_otu.xml

Co-authored-by: Björn Grüning <[email protected]>

* Change xml to blastxml format

Co-authored-by: Björn Grüning <[email protected]>

* Update tools/virAnnot/virAnnot_otu.xml

Co-authored-by: Björn Grüning <[email protected]>

* Add suite info

* Simplify citation

* Remove unnecessary files

* Fix .shed synthaxe

* Fix .shed synthaxe

* reduce size of test data

* Fix first test

* Update tools/virAnnot/virAnnot_tsv2krona.xml

Co-authored-by: Björn Grüning <[email protected]>

* Add useful help info

* Add useful help info for blast2tsv

* Add useful help info for rps2tsv

* Add useful help info for tsv2krona

* Fix missing single quote

* Remove single quote from for loop

* Change purpose of script: generate abundance files to be used by existing krona

* Remove krona test unnecessary data

* Update rps2tsv to use Interpro instead of pfam because pfam wbesite is decommissioned

* Adapt blast2tsv XML to abundance files

* Update blast2tsv test data

* Remove blast2tsv old test data

* Fix flake8 messages which I can't reproduce

* Fix tool output definition equal to input name

* Fix tool output name

* Update data for tests

* Change input select to integer

* Fix typo and quotes

* Fix fasta sequence KeyError

* Fix renaming of data test files for test

* Specific requirements does not work

* Fix requirements call

* this combination should work

* Add missing requirement openpyxl

* Try fixing lintr error messages

* Try fixing lintr error messages indent

* Update test data for rps2tsv

* Try fixing R indentation

* Try fixing R lint

* misc changes

* add zip as dependency

* add it to macro to save one container

* lint fixes

* Clean test-data

---------

Co-authored-by: Björn Grüning <[email protected]>
Co-authored-by: Bjoern Gruening <[email protected]>
  • Loading branch information
3 people authored Mar 4, 2024
1 parent 1b6341a commit 3a3b40c
Show file tree
Hide file tree
Showing 27 changed files with 38,026 additions and 0 deletions.
22 changes: 22 additions & 0 deletions tools/virAnnot/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
categories:
- Metagenomics
description: virAnnot wrappers
homepage_url: https://github.com/marieBvr/virAnnot
type: unrestricted
long_description: VirAnnot was build to ease the assembly, blast search, taxonomic
annotation and OTUs assignation of viral metagenomic NGS data. It is designed to
identify viruses in plant metagenomic data but it can be used to assemble and annotate
any sequences with the NCBI taxonomy.
name: virannot
owner: melefebvre
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/virAnnot
auto_tool_repositories:
name_template: "{{ tool_id }}"
description_template: "Wrapper for the anndata tool suite: {{ tool_name }}"
suite:
name: "suite_virannot"
description: "Add taxonomic info to HTS data and assign viral OTUs"
long_description: "VirAnnot was build to ease the assembly, blast search, taxonomic
annotation and OTUs assignation of viral metagenomic NGS data. It is designed to
identify viruses in plant metagenomic data but it can be used to assemble and annotate
any sequences with the NCBI taxonomy."
284 changes: 284 additions & 0 deletions tools/virAnnot/blast2tsv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
#!/usr/bin/env python3


# Name: blast2tsv
# Author(s): Sebastien Theil, Marie Lefebvre - INRAE
# Aims: Convert blast xml output to tsv and add taxonomy


import argparse
import csv
import logging as log
import os

from Bio import Entrez
from Bio import SeqIO
from Bio.Blast import NCBIXML
from ete3 import NCBITaxa

ncbi = NCBITaxa()


def main():
options = _set_options()
_set_log_level(options.verbosity)
hits = _read_xml(options)
_write_tsv(options, hits)


def _guess_database(accession):
"""Guess the correct database for querying based off the format of the accession"""
database_mappings_refseq = {'AC_': 'nuccore', 'NC_': 'nuccore', 'NG_': 'nuccore',
'NT_': 'nuccore', 'NW_': 'nuccore', 'NZ_': 'nuccore',
'AP_': 'protein', 'NP_': 'protein', 'YP_': 'protein',
'XP_': 'protein', 'WP_': 'protein'}
return database_mappings_refseq[accession[0:3]]


def _read_xml(options):
"""
Parse XML blast results file
Keep only the first hit
"""
log.info("Read XML file.")
results = open(options.xml_file, 'r')
records = NCBIXML.parse(results)
xml_results = {}
for blast_record in records:
for aln in blast_record.alignments:
hit_count = 1
for hit in aln.hsps:
hsp = {}
if hit_count == 1:
first_hit_frame = hit.frame[1] if len(hit.frame) > 0 else 0 # strand
cumul_hit_identity = hit.identities if hit.identities else 0
cumul_hit_score = hit.bits # hit score
cumul_hit_evalue = hit.expect # evalue
cumul_hit_length = hit.align_length if hit.align_length is not None else 0
hit_count = hit_count + 1
else:
# all HSPs in different strand than 1st HSPs will be discarded.
if (first_hit_frame > 0 and hit.frame[1] > 0) or (first_hit_frame < 0 and hit.frame[1] < 0):
cumul_hit_identity = cumul_hit_identity + hit.identities
cumul_hit_length = cumul_hit_length + hit.align_length
cumul_hit_evalue = cumul_hit_evalue + hit.expect
cumul_hit_score = cumul_hit_score + hit.bits
hit_count = hit_count + 1
if hit_count == 1:
final_hit_count = hit_count
elif hit_count > 1:
final_hit_count = hit_count - 1
hsp["evalue"] = cumul_hit_evalue / final_hit_count # The smaller the E-value, the better the match
hsp["query_id"] = blast_record.query_id
hsp["query_length"] = blast_record.query_length # length of the query
hsp["accession"] = aln.accession.replace("ref|", "")
hsp["description"] = aln.hit_def
hsp["hit_length"] = aln.length # length of the hit
hsp["hsp_length"] = hit.align_length # length of the hsp alignment
hsp["queryOverlap"] = _get_overlap_value(options.algo, hsp, 'hsp', hsp["query_length"])[0]
if cumul_hit_length == 0:
hsp["percentIdentity"] = round(cumul_hit_identity, 1) # identity percentage
else:
hsp["percentIdentity"] = round(cumul_hit_identity / cumul_hit_length * 100, 1) # identity percentage
hsp["score"] = cumul_hit_score # The higher the bit-score, the better the sequence similarity
hsp["num_hsps"] = final_hit_count
hsp["hit_cumul_length"] = cumul_hit_length
hsp["hitOverlap"] = _get_overlap_value(options.algo, hsp, 'hit', hsp["query_length"])[1]
db = _guess_database(hsp["accession"])
try:
handle = Entrez.esummary(db=db, id=hsp["accession"])
taxid = str(int(Entrez.read(handle)[0]['TaxId']))
handle.close()
log.info("Taxid found for " + hsp["accession"])
lineage = ncbi.get_lineage(taxid)
names = ncbi.get_taxid_translator(lineage)
ordered = [names[tid] for tid in lineage]
taxonomy = ordered[1:]
hsp["tax_id"] = taxid
hsp["taxonomy"] = ';'.join(taxonomy)
hsp["organism"] = taxonomy[-1]
except RuntimeError:
hsp["tax_id"] = ""
hsp["taxonomy"] = ""
hsp["organism"] = ""
log.warning("RuntimeError - Taxid not found for " + hsp["accession"])
if hsp["evalue"] <= options.max_evalue and hsp["queryOverlap"] >= options.min_qov and \
hsp["hitOverlap"] >= options.min_hov and hsp["score"] >= options.min_score:
xml_results[hsp["query_id"]] = hsp
else:
xml_results[hsp["query_id"]] = [hsp["query_length"]]

return xml_results


def _get_overlap_value(algo, hsp, type, qlength):
"""
Set hsp or hit overlap values for hit and query
Return array [query_overlap, hit_overlap]
"""
if type == 'hsp':
q_align_len = qlength
h_align_len = hsp["hsp_length"]
else:
q_align_len = qlength
h_align_len = hsp["hit_cumul_length"]

if algo == 'BLASTX':
if q_align_len:
query_overlap = (q_align_len * 3 / q_align_len) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len / hsp["hit_length"]) * 100
elif algo == 'TBLASTN':
if q_align_len:
query_overlap = (q_align_len / q_align_len) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len * 3 / hsp["hit_length"]) * 100
elif algo == 'TBLASTX':
if q_align_len:
query_overlap = (q_align_len * 3 / hsp["hsp_length"]) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len * 3 / hsp["hit_length"]) * 100
else:
if q_align_len:
query_overlap = (q_align_len / q_align_len) * 100
if hsp["hit_length"]:
hit_overlap = (h_align_len / hsp["hit_length"]) * 100
if query_overlap is None:
query_overlap = 0
if query_overlap > 100:
query_overlap = 100
if 'hit_overlap' not in locals():
hit_overlap = 0
if hit_overlap > 100:
hit_overlap = 100

return [round(query_overlap, 0), round(hit_overlap, 0)]


def _write_tsv(options, hits):
"""
Write output
"""
# get a list of contig without corresponding number of mapped reads
if options.rn_file is not None:
with open(options.rn_file) as rn:
rows = (line.split('\t') for line in rn)
rn_list = {row[0]: row[1:] for row in rows}
fasta = SeqIO.to_dict(SeqIO.parse(open(options.fasta_file), 'fasta'))
headers = "#algo\tquery_id\tnb_reads\tquery_length\taccession\tdescription\torganism\tpercentIdentity\tnb_hsps\tqueryOverlap\thitOverlap\tevalue\tscore\ttax_id\ttaxonomy\tsequence\n"
if not os.path.exists(options.output):
os.mkdir(options.output)
tsv_file = options.output + "/blast2tsv_output.tab"
log.info("Write output file: " + tsv_file)
f = open(tsv_file, "w+")
f.write(headers)
for h in hits:
if options.rn_file is not None:
read_nb = ''.join(rn_list[h]).replace("\n", "")
else:
read_nb = ''
if len(hits[h]) > 1:
f.write(options.algo + "\t" + h + "\t" + read_nb + "\t" + str(hits[h]["query_length"]) + "\t")
f.write(hits[h]["accession"] + "\t" + hits[h]["description"] + "\t")
f.write(hits[h]["organism"] + "\t" + str(hits[h]["percentIdentity"]) + "\t")
f.write(str(hits[h]["num_hsps"]) + "\t" + str(hits[h]["queryOverlap"]) + "\t")
f.write(str(hits[h]["hitOverlap"]) + "\t" + str(hits[h]["evalue"]) + "\t")
f.write(str(hits[h]["score"]) + "\t" + str(hits[h]["tax_id"]) + "\t")
if h in fasta:
f.write(hits[h]["taxonomy"] + "\t" + str(fasta[h].seq))
else:
f.write(hits[h]["taxonomy"] + "\t\"\"")
f.write("\n")
else:
f.write(options.algo + "\t" + h + "\t" + read_nb + "\t" + str(hits[h])[1:-1] + "\t")
f.write("\n")
f.close()
_create_abundance(options, tsv_file)


def _create_abundance(options, tsv_file):
"""
extract values from tsv files
and create abundance files
"""
log.info("Calculating abundance.")
file_path = tsv_file
abundance = dict()
with open(tsv_file, 'r') as current_file:
log.debug("Reading " + file_path)
csv_reader = csv.reader(current_file, delimiter='\t')
line_count = 0
for row in csv_reader:
if line_count == 0:
# headers
line_count += 1
else:
# no annotation
if len(row) == 16:
if row[14] != "":
nb_reads = row[2]
if nb_reads == "":
current_reads_nb = 0
log.debug("No reads number for " + row[1])
else:
current_reads_nb = int(nb_reads)
contig_id = row[14]
if contig_id in abundance:
# add reads
abundance[contig_id]["reads_nb"] = abundance[row[14]]["reads_nb"] + current_reads_nb
abundance[contig_id]["contigs_nb"] = abundance[row[14]]["contigs_nb"] + 1
else:
# init reads for this taxo
abundance[contig_id] = {}
abundance[contig_id]["reads_nb"] = current_reads_nb
abundance[contig_id]["contigs_nb"] = 1
else:
log.debug("No annotations for contig " + row[1])
else:
log.debug("No annotations for contig " + row[1])
log.debug(abundance)
reads_file = open(options.output + "/blast2tsv_reads.txt", "w+")
for taxo in abundance:
reads_file.write(str(abundance[taxo]["reads_nb"]))
reads_file.write("\t")
reads_file.write("\t".join(taxo.split(";")))
reads_file.write("\n")
reads_file.close()
log.info("Abundance file created " + options.output + "/blast2tsv_reads.txt")
contigs_file = open(options.output + "/blast2tsv_contigs.txt", "w+")
for taxo in abundance:
contigs_file.write(str(abundance[taxo]["contigs_nb"]))
contigs_file.write("\t")
contigs_file.write("\t".join(taxo.split(";")))
contigs_file.write("\n")
contigs_file.close()
log.info("Abundance file created " + options.output + "/blast2tsv_contigs.txt")


def _set_options():
parser = argparse.ArgumentParser()
parser.add_argument('-x', '--xml', help='XML files with results of blast', action='store', required=True, dest='xml_file')
parser.add_argument('-rn', '--read-count', help='Tab-delimited file associating seqID with read number.', action='store', dest='rn_file')
parser.add_argument('-c', '--contigs', help='FASTA file with contigs sequence.', action='store', required=True, dest='fasta_file')
parser.add_argument('-me', '--max_evalue', help='Max evalue', action='store', type=float, default=0.0001, dest='max_evalue')
parser.add_argument('-qov', '--min_query_overlap', help='Minimum query overlap', action='store', type=int, default=5, dest='min_qov')
parser.add_argument('-mhov', '--min_hit_overlap', help='Minimum hit overlap', action='store', type=int, default=5, dest='min_hov')
parser.add_argument('-s', '--min_score', help='Minimum score', action='store', type=int, default=30, dest='min_score')
parser.add_argument('-a', '--algo', help='Blast type detection (BLASTN|BLASTP|BLASTX|TBLASTX|TBLASTN|DIAMONDX).', action='store', type=str, default='BLASTX', dest='algo')
parser.add_argument('-o', '--out', help='The output file (.csv).', action='store', type=str, default='./blast2tsv', dest='output')
parser.add_argument('-v', '--verbosity', help='Verbose level', action='store', type=int, choices=[1, 2, 3, 4], default=1)
args = parser.parse_args()
return args


def _set_log_level(verbosity):
if verbosity == 1:
log_format = '%(asctime)s %(levelname)-8s %(message)s'
log.basicConfig(level=log.INFO, format=log_format)
elif verbosity == 3:
log_format = '%(filename)s:%(lineno)s - %(asctime)s %(levelname)-8s %(message)s'
log.basicConfig(level=log.DEBUG, format=log_format)


if __name__ == "__main__":
main()
25 changes: 25 additions & 0 deletions tools/virAnnot/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<macros>
<xml name="requirements">
<requirements>
<requirement type="package" version="1.83">biopython</requirement>
<requirement type="package" version="3.1.3">ete3</requirement>
<requirement type="package" version="1.2.4">clustalo</requirement>
<requirement type="package" version="8.5.0">curl</requirement>
<requirement type="package" version="4.3.2">r-base</requirement>
<requirement type="package" version="23.12.0">pyaml</requirement>
<requirement type="package" version="3.1.2">openpyxl</requirement>
<requirement type="package" version="3.1.9">xlsxwriter</requirement>
<requirement type="package" version="2.0.1">xlrd</requirement>
<requirement type="package" version="2.2.0">pandas</requirement>
<requirement type="package" version="2.8.1">krona</requirement>
<requirement type="package" version="3.0">zip</requirement>
<yield />
</requirements>
</xml>
<token name="@HEADLESS@"><![CDATA[export QT_QPA_PLATFORM='offscreen' &&]]></token>
<xml name="citations">
<citations>
<citation type="doi">10.1094/PBIOMES-07-19-0037-A</citation>
</citations>
</xml>
</macros>
Loading

0 comments on commit 3a3b40c

Please sign in to comment.