ToxCodAn-Genome

ToxCodAn-Genome is a computational tool designed to annotate toxin genes in genomes of venomous lineages.

The guide for toxin annotation in genomes is here.

Getting Started

Installation

Clone the ToxCodAn-Genome respository and add the bin folder into your PATH:

git clone https://github.com/pedronachtigall/ToxCodAn-Genome.git
export PATH=$PATH:$PWD/ToxCodAn-Genome/bin/

echo "export PATH=$PATH:$PWD/ToxCodAn-Genome/bin/" >> ~/.bashrc to add it permanently.
- Replace ~/.bashrc to ~/.bash_profile if needed.
Apply "execution permission" to executables if needed: chmod +x $PWD/ToxCodAn-Genome/bin/*

Requirements

Python, biopython, and pandas
NCBI-BLAST (v2.9.0 or above)
Exonerate
Miniprot
GffRead
Hisat2 - Optional (used in Transcriptome assembly)
Samtools - Optional (used in Transcriptome assembly)
StringTie - Optional (used in Transcriptome assembly)
Trinity - Optional (used in Transcriptome assembly)
SPAdes - Optional (used in Transcriptome assembly)

Ensure that all requirements are working properly.

⚠️ If the user wants to install ToxCodAn-Genome and all dependencies using Conda environment, follow the steps below:

Tip: First, ensure that you have added all conda channels properly in the following order:
- conda config --add channels defaults
- conda config --add channels bioconda
- conda config --add channels conda-forge
Create the environment:
- conda create -n ToxcodanGenome -c bioconda python biopython pandas blast exonerate miniprot gffread hisat2 samtools stringtie trinity spades
Git clone the ToxCodAn-Genome repository and add the bin to your PATH:
- git clone https://github.com/pedronachtigall/ToxCodAn-Genome.git
- echo "export PATH=$PATH:$PWD/ToxCodAn-Genome/bin/" >> ~/.bashrc
  - Replace ~/.bashrc to ~/.bash_profile if needed.
It may be needed to apply "execution permission" to all bin executables in "ToxCodAn-Genome/bin/":
- chmod +x ToxCodAn-Genome/bin/*
Then, run ToxCodAn-Genome as described in the "Usage" section.
To activate the environment to run ToxCodAn-Genome just use the command: conda activate ToxcodanGenome
To deactivate the environment just use the command: conda deactivate

Toxin Database

We have designed toxin databases for some venomous lineages that were tested and working well.

The toxin databases with a brief descriptions are available in the "Databases" folder.

You can check the guide to learn how to design a custom toxin database (when venom tissue transcriptomic data is available or not).

The toxin database must be set with the -d option (see below).

Usage

Usage: toxcodan-genome.py [options]

Options:
  -h, --help            show this help message and exit
  -g fasta, --genome=fasta
                        Mandatory - genome sequence in FASTA format,
                        /path/to/genome.fasta
  -d fasta, --database=fasta
                        Mandatory - database with coding sequences (CDSs) of
                        toxins in FASTA format, /path/to/cds.fasta
  -C fasta, --cds=fasta
                        Optional - toxin coding sequences (CDSs) of the
                        individual/species previously annotated from de-
                        novo/genome-guided assembly in FASTA format,
                        /path/to/toxin_cds.fasta
  -t fasta, --transcripts=fasta
                        Optional - transcripts recovered from venom tissue
                        RNA-seq data for the individual/species using de-
                        novo/genome-guided assembly methods in FASTA format,
                        /path/to/transcripts.fasta
  -r fastq(.gz), --reads=fastq(.gz)
                        Optional - pre-processed reads (i.e., adapters trimmed
                        and low-quality reads removed) obtained from the toxin
                        tissue of the species in FASTQ(.GZ) format. If single-
                        end (or merged reads), specify only one file (e.g.,
                        path/to/reads.fastq). If paired-end, specify both
                        files in a comma-separated format (e.g.,
                        path/to/reads_1.fastq,path/to/reads_2.fastq). If you
                        also set transcripts file in the "-t" parameter, this
                        parameter/file will be ignored.
  -u fasta, --uniprot=fasta
                        Optional - path to uniprot/toxprot database to perform
                        an extra step of similarity search to compare the
                        annotated toxins to the uniprot/toxprot database. It
                        will output a report file in TXT format containing the
                        annotated toxin and the best hit in the
                        uniprot/toxprot database, alongside their percent
                        identity, alignment length, toxin length, and
                        uniprot/toxprot entry length.
  -o folder, --output=folder
                        Optional - output folder, /path/to/output_folder; if
                        not defined, the output folder will be set in the
                        current directory with the following name
                        [ToxCodAnGenome_output]
  -p int, --percid=int  Optional - threshold value used as the minimum percent
                        identity between match CDSs and genome [default=80]
  -s int, --mingenesize=int
                        Optional - threshold value used as the minimum size of
                        a gene [default=400]
  -S int, --maxgenesize=int
                        Optional - threshold value used as the maximum size of
                        a gene [default=50000]
  -l int, --length=int  Optional - minimum size of a CDS; it will remove any
                        annotated CDS shorter than the specified threshold
                        [default=200]
  -k boolean value, --keeptemp=boolean value
                        Optional - keep temporary files. Use True to keep all
                        temporary files or False to remove them
                        [default=False]
  -c int, --cpu=int     Optional - number of threads to be used in each step
                        [default=1]

Basic usage:

toxcodan-genome.py -g genome.fasta -d toxin_database.fasta

Check our tutorial to learn how to use ToxCodAn-Genome.

Check our guide to have full details about running ToxCodAn-Genome and how to perform toxin annotation in genomes.

Inputs

ToxCodAn-Genome has the following inputs as mandatory:

The genome in FASTA format through the -g option.
The toxin database in FASTA format through the -d option.

Outputs

By default, ToxCodAn-Genome outputs the following files:

ToxCodAnGenome_output/
├── annotation_removed.txt
├── annotation_warning.txt
├── matched_GTFs/
│   ├── contig_1--1234-5678.gtf
│   ├── contig_2--11234-15678.gtf
│   ├── ...
│   └── contig_N--111234-115678.gtf
├── matched_regions.gtf
├── toxin_annotation_cds.fasta
├── toxin_annotation.gtf
└── toxin_annotation_pep.fasta

Description of the output files:

toxin_annotation -> final toxin annotation files (including a gtf and two fasta files of CDSs and peptides)
annotation_warning.txt -> list of annotations in the final toxin annotation file that need manual inspection (may represent truncated paralogs, pseudogenes, and/or erroneous annotations)
annotation_removed.txt -> annotations that were removed from the final toxin annotation file (may represent erroneous/incomplete annotations)
matched_regions.gtf -> regions of genome matching to full-length toxin CDSs in the database (that returned or not a toxin annotation)
matched_GTFs/ -> folder containing all matched regions that returned a toxin annotation (each file is named by the genome position as follows "contig--start-end")

If you want to keep all temporary files, run ToxCodAn-Genome with the parameter -k True.

Citation

If you use or discuss ToxCodAn-Genome, its guide, or any script available at this repository, please cite:

Nachtigall et al. (2024) ToxCodAn-Genome: an automated pipeline for toxin-gene annotation in genome assembly of venomous lineages. Giga Science. DOI: https://doi.org/10.1093/gigascience/giad116

License

GNU GPLv3

Contact

🐛🆘💬

To report bugs, to ask for help and to give any feedback, please contact Pedro G. Nachtigall: [email protected]

Frequently Asked Questions (FAQ)

[Q1] What Operation System (OS) do I need to use ToxCodAn-Genome?

We tested ToxCodAn-Genome in Linux Ubuntu 16, 18 and 20. However, we believe that ToxCodAn-Genome should work on any UNIX OS able to have all dependencies installed.

[Q2] How long will take to ToxCodAn-Genome finish the analysis?

We tested ToxCodAn-Genome using a personal computer (6-Core i7 with 16Gb memory) and 6 threads (-c 6). It took less than 2 minutes to finish the analysis using a genome with an average size of 1.6Gb. If the user has more threads available for use, the running time will decrease.

[Q3] ToxCodAn-Genome is returning an error in the "generating final output" step similar to subprocess.CalledProcessError and Segmentation fault (core dumped). What should I do?

This error can be caused by one or more lines containing a huge sequence. Some tools and packages, like Bio::DB::Fasta used by GffRead, can't process a fasta file with lines containing more than 65,536 characters. So, if you have any large sequence in one unique line, do the following:
- download the script BreakLines.py to "break" the genomic sequences into 100 nts per line
- wget https://raw.githubusercontent.com/pedronachtigall/CodAn/master/scripts/BreakLines.py
- run BreakLines script: python BreakLines.py genome.fasta genome_breaklines.fasta
- use the "genome_breaklines.fasta" to run ToxCodAn-Genome. It will solve this issue.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
Databases		Databases
Guide		Guide
Tutorial		Tutorial
bin		bin
misc		misc
LICENSE.txt		LICENSE.txt
README.md		README.md
ToxcodanGenome_logo.png		ToxcodanGenome_logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToxCodAn-Genome

Getting Started

Installation

Requirements

Toxin Database

Usage

Inputs

Outputs

Citation

License

Contact

Frequently Asked Questions (FAQ)

About

Releases

Packages

Languages

License

pedronachtigall/ToxCodAn-Genome

Folders and files

Latest commit

History

Repository files navigation

ToxCodAn-Genome

Getting Started

Installation

Requirements

Toxin Database

Usage

Inputs

Outputs

Citation

License

Contact

Frequently Asked Questions (FAQ)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages