-
Notifications
You must be signed in to change notification settings - Fork 441
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* First commit * Flake8 cleaning * Lint R and python files * Add remote_repository_url * Fix python lint which I can't reproduce * Update tools/virAnnot/virAnnot_blast2tsv.xml Co-authored-by: Björn Grüning <[email protected]> * Update tools/virAnnot/virAnnot_blast2tsv.xml Co-authored-by: Björn Grüning <[email protected]> * Update tools/virAnnot/virAnnot_otu.xml Co-authored-by: Björn Grüning <[email protected]> * Change xml to blastxml format Co-authored-by: Björn Grüning <[email protected]> * Update tools/virAnnot/virAnnot_otu.xml Co-authored-by: Björn Grüning <[email protected]> * Add suite info * Simplify citation * Remove unnecessary files * Fix .shed synthaxe * Fix .shed synthaxe * reduce size of test data * Fix first test * Update tools/virAnnot/virAnnot_tsv2krona.xml Co-authored-by: Björn Grüning <[email protected]> * Add useful help info * Add useful help info for blast2tsv * Add useful help info for rps2tsv * Add useful help info for tsv2krona * Fix missing single quote * Remove single quote from for loop * Change purpose of script: generate abundance files to be used by existing krona * Remove krona test unnecessary data * Update rps2tsv to use Interpro instead of pfam because pfam wbesite is decommissioned * Adapt blast2tsv XML to abundance files * Update blast2tsv test data * Remove blast2tsv old test data * Fix flake8 messages which I can't reproduce * Fix tool output definition equal to input name * Fix tool output name * Update data for tests * Change input select to integer * Fix typo and quotes * Fix fasta sequence KeyError * Fix renaming of data test files for test * Specific requirements does not work * Fix requirements call * this combination should work * Add missing requirement openpyxl * Try fixing lintr error messages * Try fixing lintr error messages indent * Update test data for rps2tsv * Try fixing R indentation * Try fixing R lint * First commit * Flake8 cleaning * Lint R and python files * Add remote_repository_url * Fix python lint which I can't reproduce * Update tools/virAnnot/virAnnot_blast2tsv.xml Co-authored-by: Björn Grüning <[email protected]> * Update tools/virAnnot/virAnnot_blast2tsv.xml Co-authored-by: Björn Grüning <[email protected]> * Update tools/virAnnot/virAnnot_otu.xml Co-authored-by: Björn Grüning <[email protected]> * Change xml to blastxml format Co-authored-by: Björn Grüning <[email protected]> * Update tools/virAnnot/virAnnot_otu.xml Co-authored-by: Björn Grüning <[email protected]> * Add suite info * Simplify citation * Remove unnecessary files * Fix .shed synthaxe * Fix .shed synthaxe * reduce size of test data * Fix first test * Update tools/virAnnot/virAnnot_tsv2krona.xml Co-authored-by: Björn Grüning <[email protected]> * Add useful help info * Add useful help info for blast2tsv * Add useful help info for rps2tsv * Add useful help info for tsv2krona * Fix missing single quote * Remove single quote from for loop * Change purpose of script: generate abundance files to be used by existing krona * Remove krona test unnecessary data * Update rps2tsv to use Interpro instead of pfam because pfam wbesite is decommissioned * Adapt blast2tsv XML to abundance files * Update blast2tsv test data * Remove blast2tsv old test data * Fix flake8 messages which I can't reproduce * Fix tool output definition equal to input name * Fix tool output name * Update data for tests * Change input select to integer * Fix typo and quotes * Fix fasta sequence KeyError * Fix renaming of data test files for test * Specific requirements does not work * Fix requirements call * this combination should work * Add missing requirement openpyxl * Try fixing lintr error messages * Try fixing lintr error messages indent * Update test data for rps2tsv * Try fixing R indentation * Try fixing R lint * misc changes * add zip as dependency * add it to macro to save one container * lint fixes * Clean test-data --------- Co-authored-by: Björn Grüning <[email protected]> Co-authored-by: Bjoern Gruening <[email protected]>
- Loading branch information
1 parent
1b6341a
commit 3a3b40c
Showing
27 changed files
with
38,026 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
categories: | ||
- Metagenomics | ||
description: virAnnot wrappers | ||
homepage_url: https://github.com/marieBvr/virAnnot | ||
type: unrestricted | ||
long_description: VirAnnot was build to ease the assembly, blast search, taxonomic | ||
annotation and OTUs assignation of viral metagenomic NGS data. It is designed to | ||
identify viruses in plant metagenomic data but it can be used to assemble and annotate | ||
any sequences with the NCBI taxonomy. | ||
name: virannot | ||
owner: melefebvre | ||
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/master/tools/virAnnot | ||
auto_tool_repositories: | ||
name_template: "{{ tool_id }}" | ||
description_template: "Wrapper for the anndata tool suite: {{ tool_name }}" | ||
suite: | ||
name: "suite_virannot" | ||
description: "Add taxonomic info to HTS data and assign viral OTUs" | ||
long_description: "VirAnnot was build to ease the assembly, blast search, taxonomic | ||
annotation and OTUs assignation of viral metagenomic NGS data. It is designed to | ||
identify viruses in plant metagenomic data but it can be used to assemble and annotate | ||
any sequences with the NCBI taxonomy." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,284 @@ | ||
#!/usr/bin/env python3 | ||
|
||
|
||
# Name: blast2tsv | ||
# Author(s): Sebastien Theil, Marie Lefebvre - INRAE | ||
# Aims: Convert blast xml output to tsv and add taxonomy | ||
|
||
|
||
import argparse | ||
import csv | ||
import logging as log | ||
import os | ||
|
||
from Bio import Entrez | ||
from Bio import SeqIO | ||
from Bio.Blast import NCBIXML | ||
from ete3 import NCBITaxa | ||
|
||
ncbi = NCBITaxa() | ||
|
||
|
||
def main(): | ||
options = _set_options() | ||
_set_log_level(options.verbosity) | ||
hits = _read_xml(options) | ||
_write_tsv(options, hits) | ||
|
||
|
||
def _guess_database(accession): | ||
"""Guess the correct database for querying based off the format of the accession""" | ||
database_mappings_refseq = {'AC_': 'nuccore', 'NC_': 'nuccore', 'NG_': 'nuccore', | ||
'NT_': 'nuccore', 'NW_': 'nuccore', 'NZ_': 'nuccore', | ||
'AP_': 'protein', 'NP_': 'protein', 'YP_': 'protein', | ||
'XP_': 'protein', 'WP_': 'protein'} | ||
return database_mappings_refseq[accession[0:3]] | ||
|
||
|
||
def _read_xml(options): | ||
""" | ||
Parse XML blast results file | ||
Keep only the first hit | ||
""" | ||
log.info("Read XML file.") | ||
results = open(options.xml_file, 'r') | ||
records = NCBIXML.parse(results) | ||
xml_results = {} | ||
for blast_record in records: | ||
for aln in blast_record.alignments: | ||
hit_count = 1 | ||
for hit in aln.hsps: | ||
hsp = {} | ||
if hit_count == 1: | ||
first_hit_frame = hit.frame[1] if len(hit.frame) > 0 else 0 # strand | ||
cumul_hit_identity = hit.identities if hit.identities else 0 | ||
cumul_hit_score = hit.bits # hit score | ||
cumul_hit_evalue = hit.expect # evalue | ||
cumul_hit_length = hit.align_length if hit.align_length is not None else 0 | ||
hit_count = hit_count + 1 | ||
else: | ||
# all HSPs in different strand than 1st HSPs will be discarded. | ||
if (first_hit_frame > 0 and hit.frame[1] > 0) or (first_hit_frame < 0 and hit.frame[1] < 0): | ||
cumul_hit_identity = cumul_hit_identity + hit.identities | ||
cumul_hit_length = cumul_hit_length + hit.align_length | ||
cumul_hit_evalue = cumul_hit_evalue + hit.expect | ||
cumul_hit_score = cumul_hit_score + hit.bits | ||
hit_count = hit_count + 1 | ||
if hit_count == 1: | ||
final_hit_count = hit_count | ||
elif hit_count > 1: | ||
final_hit_count = hit_count - 1 | ||
hsp["evalue"] = cumul_hit_evalue / final_hit_count # The smaller the E-value, the better the match | ||
hsp["query_id"] = blast_record.query_id | ||
hsp["query_length"] = blast_record.query_length # length of the query | ||
hsp["accession"] = aln.accession.replace("ref|", "") | ||
hsp["description"] = aln.hit_def | ||
hsp["hit_length"] = aln.length # length of the hit | ||
hsp["hsp_length"] = hit.align_length # length of the hsp alignment | ||
hsp["queryOverlap"] = _get_overlap_value(options.algo, hsp, 'hsp', hsp["query_length"])[0] | ||
if cumul_hit_length == 0: | ||
hsp["percentIdentity"] = round(cumul_hit_identity, 1) # identity percentage | ||
else: | ||
hsp["percentIdentity"] = round(cumul_hit_identity / cumul_hit_length * 100, 1) # identity percentage | ||
hsp["score"] = cumul_hit_score # The higher the bit-score, the better the sequence similarity | ||
hsp["num_hsps"] = final_hit_count | ||
hsp["hit_cumul_length"] = cumul_hit_length | ||
hsp["hitOverlap"] = _get_overlap_value(options.algo, hsp, 'hit', hsp["query_length"])[1] | ||
db = _guess_database(hsp["accession"]) | ||
try: | ||
handle = Entrez.esummary(db=db, id=hsp["accession"]) | ||
taxid = str(int(Entrez.read(handle)[0]['TaxId'])) | ||
handle.close() | ||
log.info("Taxid found for " + hsp["accession"]) | ||
lineage = ncbi.get_lineage(taxid) | ||
names = ncbi.get_taxid_translator(lineage) | ||
ordered = [names[tid] for tid in lineage] | ||
taxonomy = ordered[1:] | ||
hsp["tax_id"] = taxid | ||
hsp["taxonomy"] = ';'.join(taxonomy) | ||
hsp["organism"] = taxonomy[-1] | ||
except RuntimeError: | ||
hsp["tax_id"] = "" | ||
hsp["taxonomy"] = "" | ||
hsp["organism"] = "" | ||
log.warning("RuntimeError - Taxid not found for " + hsp["accession"]) | ||
if hsp["evalue"] <= options.max_evalue and hsp["queryOverlap"] >= options.min_qov and \ | ||
hsp["hitOverlap"] >= options.min_hov and hsp["score"] >= options.min_score: | ||
xml_results[hsp["query_id"]] = hsp | ||
else: | ||
xml_results[hsp["query_id"]] = [hsp["query_length"]] | ||
|
||
return xml_results | ||
|
||
|
||
def _get_overlap_value(algo, hsp, type, qlength): | ||
""" | ||
Set hsp or hit overlap values for hit and query | ||
Return array [query_overlap, hit_overlap] | ||
""" | ||
if type == 'hsp': | ||
q_align_len = qlength | ||
h_align_len = hsp["hsp_length"] | ||
else: | ||
q_align_len = qlength | ||
h_align_len = hsp["hit_cumul_length"] | ||
|
||
if algo == 'BLASTX': | ||
if q_align_len: | ||
query_overlap = (q_align_len * 3 / q_align_len) * 100 | ||
if hsp["hit_length"]: | ||
hit_overlap = (h_align_len / hsp["hit_length"]) * 100 | ||
elif algo == 'TBLASTN': | ||
if q_align_len: | ||
query_overlap = (q_align_len / q_align_len) * 100 | ||
if hsp["hit_length"]: | ||
hit_overlap = (h_align_len * 3 / hsp["hit_length"]) * 100 | ||
elif algo == 'TBLASTX': | ||
if q_align_len: | ||
query_overlap = (q_align_len * 3 / hsp["hsp_length"]) * 100 | ||
if hsp["hit_length"]: | ||
hit_overlap = (h_align_len * 3 / hsp["hit_length"]) * 100 | ||
else: | ||
if q_align_len: | ||
query_overlap = (q_align_len / q_align_len) * 100 | ||
if hsp["hit_length"]: | ||
hit_overlap = (h_align_len / hsp["hit_length"]) * 100 | ||
if query_overlap is None: | ||
query_overlap = 0 | ||
if query_overlap > 100: | ||
query_overlap = 100 | ||
if 'hit_overlap' not in locals(): | ||
hit_overlap = 0 | ||
if hit_overlap > 100: | ||
hit_overlap = 100 | ||
|
||
return [round(query_overlap, 0), round(hit_overlap, 0)] | ||
|
||
|
||
def _write_tsv(options, hits): | ||
""" | ||
Write output | ||
""" | ||
# get a list of contig without corresponding number of mapped reads | ||
if options.rn_file is not None: | ||
with open(options.rn_file) as rn: | ||
rows = (line.split('\t') for line in rn) | ||
rn_list = {row[0]: row[1:] for row in rows} | ||
fasta = SeqIO.to_dict(SeqIO.parse(open(options.fasta_file), 'fasta')) | ||
headers = "#algo\tquery_id\tnb_reads\tquery_length\taccession\tdescription\torganism\tpercentIdentity\tnb_hsps\tqueryOverlap\thitOverlap\tevalue\tscore\ttax_id\ttaxonomy\tsequence\n" | ||
if not os.path.exists(options.output): | ||
os.mkdir(options.output) | ||
tsv_file = options.output + "/blast2tsv_output.tab" | ||
log.info("Write output file: " + tsv_file) | ||
f = open(tsv_file, "w+") | ||
f.write(headers) | ||
for h in hits: | ||
if options.rn_file is not None: | ||
read_nb = ''.join(rn_list[h]).replace("\n", "") | ||
else: | ||
read_nb = '' | ||
if len(hits[h]) > 1: | ||
f.write(options.algo + "\t" + h + "\t" + read_nb + "\t" + str(hits[h]["query_length"]) + "\t") | ||
f.write(hits[h]["accession"] + "\t" + hits[h]["description"] + "\t") | ||
f.write(hits[h]["organism"] + "\t" + str(hits[h]["percentIdentity"]) + "\t") | ||
f.write(str(hits[h]["num_hsps"]) + "\t" + str(hits[h]["queryOverlap"]) + "\t") | ||
f.write(str(hits[h]["hitOverlap"]) + "\t" + str(hits[h]["evalue"]) + "\t") | ||
f.write(str(hits[h]["score"]) + "\t" + str(hits[h]["tax_id"]) + "\t") | ||
if h in fasta: | ||
f.write(hits[h]["taxonomy"] + "\t" + str(fasta[h].seq)) | ||
else: | ||
f.write(hits[h]["taxonomy"] + "\t\"\"") | ||
f.write("\n") | ||
else: | ||
f.write(options.algo + "\t" + h + "\t" + read_nb + "\t" + str(hits[h])[1:-1] + "\t") | ||
f.write("\n") | ||
f.close() | ||
_create_abundance(options, tsv_file) | ||
|
||
|
||
def _create_abundance(options, tsv_file): | ||
""" | ||
extract values from tsv files | ||
and create abundance files | ||
""" | ||
log.info("Calculating abundance.") | ||
file_path = tsv_file | ||
abundance = dict() | ||
with open(tsv_file, 'r') as current_file: | ||
log.debug("Reading " + file_path) | ||
csv_reader = csv.reader(current_file, delimiter='\t') | ||
line_count = 0 | ||
for row in csv_reader: | ||
if line_count == 0: | ||
# headers | ||
line_count += 1 | ||
else: | ||
# no annotation | ||
if len(row) == 16: | ||
if row[14] != "": | ||
nb_reads = row[2] | ||
if nb_reads == "": | ||
current_reads_nb = 0 | ||
log.debug("No reads number for " + row[1]) | ||
else: | ||
current_reads_nb = int(nb_reads) | ||
contig_id = row[14] | ||
if contig_id in abundance: | ||
# add reads | ||
abundance[contig_id]["reads_nb"] = abundance[row[14]]["reads_nb"] + current_reads_nb | ||
abundance[contig_id]["contigs_nb"] = abundance[row[14]]["contigs_nb"] + 1 | ||
else: | ||
# init reads for this taxo | ||
abundance[contig_id] = {} | ||
abundance[contig_id]["reads_nb"] = current_reads_nb | ||
abundance[contig_id]["contigs_nb"] = 1 | ||
else: | ||
log.debug("No annotations for contig " + row[1]) | ||
else: | ||
log.debug("No annotations for contig " + row[1]) | ||
log.debug(abundance) | ||
reads_file = open(options.output + "/blast2tsv_reads.txt", "w+") | ||
for taxo in abundance: | ||
reads_file.write(str(abundance[taxo]["reads_nb"])) | ||
reads_file.write("\t") | ||
reads_file.write("\t".join(taxo.split(";"))) | ||
reads_file.write("\n") | ||
reads_file.close() | ||
log.info("Abundance file created " + options.output + "/blast2tsv_reads.txt") | ||
contigs_file = open(options.output + "/blast2tsv_contigs.txt", "w+") | ||
for taxo in abundance: | ||
contigs_file.write(str(abundance[taxo]["contigs_nb"])) | ||
contigs_file.write("\t") | ||
contigs_file.write("\t".join(taxo.split(";"))) | ||
contigs_file.write("\n") | ||
contigs_file.close() | ||
log.info("Abundance file created " + options.output + "/blast2tsv_contigs.txt") | ||
|
||
|
||
def _set_options(): | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('-x', '--xml', help='XML files with results of blast', action='store', required=True, dest='xml_file') | ||
parser.add_argument('-rn', '--read-count', help='Tab-delimited file associating seqID with read number.', action='store', dest='rn_file') | ||
parser.add_argument('-c', '--contigs', help='FASTA file with contigs sequence.', action='store', required=True, dest='fasta_file') | ||
parser.add_argument('-me', '--max_evalue', help='Max evalue', action='store', type=float, default=0.0001, dest='max_evalue') | ||
parser.add_argument('-qov', '--min_query_overlap', help='Minimum query overlap', action='store', type=int, default=5, dest='min_qov') | ||
parser.add_argument('-mhov', '--min_hit_overlap', help='Minimum hit overlap', action='store', type=int, default=5, dest='min_hov') | ||
parser.add_argument('-s', '--min_score', help='Minimum score', action='store', type=int, default=30, dest='min_score') | ||
parser.add_argument('-a', '--algo', help='Blast type detection (BLASTN|BLASTP|BLASTX|TBLASTX|TBLASTN|DIAMONDX).', action='store', type=str, default='BLASTX', dest='algo') | ||
parser.add_argument('-o', '--out', help='The output file (.csv).', action='store', type=str, default='./blast2tsv', dest='output') | ||
parser.add_argument('-v', '--verbosity', help='Verbose level', action='store', type=int, choices=[1, 2, 3, 4], default=1) | ||
args = parser.parse_args() | ||
return args | ||
|
||
|
||
def _set_log_level(verbosity): | ||
if verbosity == 1: | ||
log_format = '%(asctime)s %(levelname)-8s %(message)s' | ||
log.basicConfig(level=log.INFO, format=log_format) | ||
elif verbosity == 3: | ||
log_format = '%(filename)s:%(lineno)s - %(asctime)s %(levelname)-8s %(message)s' | ||
log.basicConfig(level=log.DEBUG, format=log_format) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
<macros> | ||
<xml name="requirements"> | ||
<requirements> | ||
<requirement type="package" version="1.83">biopython</requirement> | ||
<requirement type="package" version="3.1.3">ete3</requirement> | ||
<requirement type="package" version="1.2.4">clustalo</requirement> | ||
<requirement type="package" version="8.5.0">curl</requirement> | ||
<requirement type="package" version="4.3.2">r-base</requirement> | ||
<requirement type="package" version="23.12.0">pyaml</requirement> | ||
<requirement type="package" version="3.1.2">openpyxl</requirement> | ||
<requirement type="package" version="3.1.9">xlsxwriter</requirement> | ||
<requirement type="package" version="2.0.1">xlrd</requirement> | ||
<requirement type="package" version="2.2.0">pandas</requirement> | ||
<requirement type="package" version="2.8.1">krona</requirement> | ||
<requirement type="package" version="3.0">zip</requirement> | ||
<yield /> | ||
</requirements> | ||
</xml> | ||
<token name="@HEADLESS@"><![CDATA[export QT_QPA_PLATFORM='offscreen' &&]]></token> | ||
<xml name="citations"> | ||
<citations> | ||
<citation type="doi">10.1094/PBIOMES-07-19-0037-A</citation> | ||
</citations> | ||
</xml> | ||
</macros> |
Oops, something went wrong.