Skip to content

anton-shikov/cry_processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status python3.7 image License: GPL v3

CryProcessor

CryProcessor is a high-troughtput tool for the Cry toxins mining from the fasta-files or directly from the illumina reads.

About CryProcessor

CryProcessor is a python-written tool for searching and extracting Cry toxins from illumina sequence data or from the protein fasta files. It includes several parts: an hmm-based scanning for potential Cry toxins, obtaining information about the domains, extracting Cry toxins with 3 domains only and comparing found toxins with Bt nomenclature.

The mode for performing the toxins search directly from the illumina reads implies building an assembly graph (using SPAdes) and the subsequent mining toxins directly from the obtained assebmly graph.

CryProcessor Pipeline

The following text stands for the full pipeline description (for the illumina reads). To start, SPAdes (http://cab.spbu.ru/software/spades/) or metaSPAdes (http://cab.spbu.ru/software/meta-spades/) are implemented to get the assembly graph from the fastq-files. After that, the potential Cry toxins (with at least 30% identity to the hmm-consensus) are extracted from the assembly paths via PathRacer (http://cab.spbu.ru/software/pathracer/). Then hmmsearch (http://hmmer.org/) is used to find Cry toxin domains in the obtained sequences. In the next step, the results of hmmsearch are combined to get the toxins that posses all three domains.

The coordinates of the domains are used to cut flanking sequences and save the domains with the corresponding linkers. The full sequences (without processing procedure) are used to compare the obtained toxins with the Bt nomenclature database via diamond blastp (https://github.com/bbuchfink/diamond). The non-identical sequences are extracted and marked as the potentially new toxins.

For all the found sequences (both identical to presented in Bt nomenclature and the novel sequences) an online ipg-annotation (Identical Protein Group) is performed (to see the annotation output read the annotation output section below). Finally, nucleotide sequences, corresponding to the protein sequences of the found toxins, are downloaded. Metadata will be uploaded only if the accession numbers are present in the query.

Installation and Usage

Prerequisites

  • python (3.7 or higher);
  • Biopython (1.66 or higher).

To install Biopython use the following command:

~$ pip install biopython

Installing

To install CryProcessor clone git repository to your PC:

~$ git clone https://github.com/lab7arriam/cry_processor

After downloading, CryProcessor is ready to use. You can also add CryProcessor to the PATH typing:

~$ PATH=$PATH:/path/to/install

If you want to use prebuild Docker container with CryProcessor, pull it using the following command:

~$ docker pull lab7arriam/cry_processor

Running and Using Tool

Quick Usage

To extract Cry toxins from the protein fasta file simply execute the following command:

~$  /path/to/install/cry_processor.py -fi input.faa  -od output_dir

This command will automatically search for the Cry toxins in the fasta file with amino acid sequences.

If you use Docker container, you should use following command:

~$ docker run --rm -v /path/to/data:/data lab7arriam/cry_processor cry_processor -fi /data/<input file> -od /data/<output_dir> [args]

Supported Input Formats

  • fasta files with protein sequences;
  • gfa files (representing genome assebly graph);
  • forward and reverse illumina reads.

Tool Options:

The full list of tool options:

-fi <input.fasta> or <input.gfa> an input file in the fasta format of in the gfa format
-hm <path to hmmer directory> path to the hmmer directory if you want to use local hmmer
-pr <1 or 2> the processig type: 1 for extracting all the domains, 2 for extrating 2-3 domains only (default 1)
-th the nubmer of threads for hmmer/SPAdes/PathRacer (default 8)
-ma <e-mail> an e-mail address for the connecting to NCBI
-od <output dir> the output directory
-r <do or fd> the running mode: do - the domain only search only; fd - the full toxins mining with the subsequent domain search
-a (--annotate) perform the data anotation with ipg (only if the accession numbers are present)
-nu <fn or pn or an> upload the nucleotide sequences: fn - the full sequences, pn - the processed sequences, an - the both variants
-pa (--pathracer) - launching PathRacer on the gfa file
-fo <input_1.fastq> - forward illumina reads
-re <input_2.fastq> - reverse illumina reads
-n (--meta) - the flag for specifying the metagenomic mode forSPAdes
-k <number> - the K-mer size for PathRacer (default 21)
-s (--silent) - disable the console output
-f --force - forced directory overwriting

Using Tool for the Fasta Files

To use the tool for the files in the fasta format execute the command, presented in quick usage, you can also specify the annotation (writing an e-mail address is strongly recommended):

~$ /path/to/install/cry_processor.py -fi input.faa  -od output_dir -ma <e-mail address> --annotate

Use the -nu flag to download nucleotide sequences:

~$ /path/to/install/cry_processor.py  -fi input.faa  -od output_dir -ma <e-mail address> --annotate -nu pn

The pipeline of searching could be performed in two modes:

  • do - the domain only mode. Searches the Cry toxin domains from the full queiry, then combines this data to extract the toxins with 3 domains;
  • fd - the find domains mode. At the begining, hmmsearch with the full Cry toxin model is performed. Next, the domains are extracted and combined. This mode is quicker, as far as the domain search is performed on previously extracted sequences.

Using Tool for the .gfa Files

You can apply the Cry toxins search directly from the assembly graph in the gfa format with the following commad:

~$ /path/to/install/cry_processor.py -fi input.gfa  -od output_dir --path_racer

Using Tool for the Illumina Reads

This mode includes the reads assembly with SPAdes and the subsequent hmm-based toxins mining. To implement this use the following command:

~$ /path/to/install/cry_processor.py  -fo forward_reads.fastq -re reverse_reads.fastq -od output_dir 

If you want to search for the Cry toxins from metagenomic reads specify --meta flag:

~$ /path/to/install/cry_processor.py  -fo forward_reads.fastq -re reverse_reads.fastq -od output_dir --meta

Tips for the Correct Running

Note that you cannot mix modes:

  • Do not use the --pathracer flag with the illumina fastq files query flags (-fo and -re);
  • Do not mix the -fi agrument with -fo and -re arguments;
  • Do not mix the --pathracer flag and the --meta flag;
  • Do not specify the --meta mode with the -fi agrument;
  • You should use both the -fo and -re argumens together;
Using the -nu flag is possible only if the --annotate flag is specified.
Note that performing the anotation is not recommended for the gfa and assembly modes, because the online annotation is impossible without the accession numbers in the query.

The Annotation Output

Using the --annotate flag will perform the NCBI-search in the ipg database for the submitted accession numbers within the query and return gathered information in tsv-format with the following structure:

protein_id initial_description top_cry_hit cry_identity source nucl_accession start stop strand ipg_prot_id ipg_prot_name organism strain assembly
AFU17015.1 AFU17015.1 pesticidal crystal proteinBacillus thuringiensis MC28 Cry4Cc1 99.4 INSDC CP003690.1 58993 62628 + AFU17015.1 pesticidal crystal protein Bacillus thuringiensis MC28 MC28 GCA_000300475.1

Columns description:

  • protein_id - initial sequence id in the query
  • initial_description - protein description in the query
  • top_cry_hit - Cry protein from Bt nomenclature with the best identity score according to diamond
  • cry_identity - identity score with the closest Cry protein from Bt nomenclature
  • source - database source of the sequence
  • nucl_accession - accession number for the nucleotide sequence
  • start - starting position in the full nucleotide sequence
  • stop - ending position in the full nucleotide sequence
  • strand - DNA strand orientation for the nucleotide sequence
  • ipg_prot_id - protein id according to the ipg database
  • ipg_prot_name - protein description to the ipg database
  • organism - taxon description
  • strain - strain of the organism
  • assembly - assembly ids where the nucleotide sequence is present
Rows possesing the first four coloums refer to the initial sequences (which are found in the query), results for all the identical proteins are marked with --.
Here's an expamle of such rows:
-- -- -- -- RefSeq NZ_CP010111.1 142760 146194 - WP_080989235.1 pesticidal protein Bacillus thuringiensis serovar indiana HD521 GCF_001183785.1
Note that all the rows, marked with -- in the first four columns, are identical to the initial sequenses in the queiry with the full columns, located above.

The Output Files Structure

In the output directory, specified with the -od flag, the following subdirectories with the following files could be generated:

  • assembly - this directory is created if the assembly mode is enabled. It contains the files and directories that refer to the SPAdes output, for instanse, the assembly graph with the contigs in the gfa format. To see the full list of the SPAdes output look at the SPAdes manual (http://cab.spbu.ru/software/spades/);
  • pathracer - this directory includes the PathRacer output. It will be created if the pathracer mode or the assembly mode are enabled. To see the full list of the PathRacer output read the manual (http://cab.spbu.ru/software/pathracer/);
  • full_toxins -this directory is created if the fd mode is enabled. It contains the following files:
    • <input>_full_extracted.sto - the result of hmmsearch with the full-toxin model in the sto format;
    • <input>_full_extracted.fasta - the result of hmmsearch with the full-toxin model in the fasta format;
  • domains. This directory contains the results of hmmsearch for all the domain models in the sto and fasta formats:
    • <input>_D1_extracted.sto;
    • <input>_D1_extracted.fasta;
    • <input>_D2_extracted.sto;
    • <input>_D2_extracted.fasta;
    • <input>_D3_extracted.sto;
    • <input>_D3_extracted.fasta;
  • logs. This directory contains the log files for the different stages of the pipeline:
    • spades.log - the log file for the assembly process;
    • pathracer.log - the log file for the hmm-based search for the Cry toxins from the gfa file;
    • full_extraction.log - the log file for performing hmmsearch on the full-toxin model;
    • domains_extraction.log - the log file for performing hmmsearch on the domain models;
    • coordinate_matches_<input>.txt - the dictionary of the domain mappings for the query;
    • diamond.log - the log file for performing diamond blastp agaist the Bt nomenclature database;
    • aligned_<input>.fa - the aligned sequences from diamond blastp;
    • unaligned_<input>.fa - the unligned sequences from diamond blastp;
    • diamond_matches_<input>.txt - the results of diamond blastp in the table form;
    • cry_processor.log - the main CryProcessor logfile;
  • first_search_<input>.fasta - the result of hmmsearch with the full model on the query in the fasta format. This file is later used as a queiry for the domains search;
  • raw_full_<input>.fasta. This file contains the full protein sequences of the Cry toxins possesing all three domains;
  • raw_processed_<input>.fasta -this file contains the processed protein sequences (without left and right lanking parts) of the Cry toxins possesing all three domains;
  • proteins_domain_mapping_full_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the uprocessed protein sequences;
  • proteins_domain_mapping_processed_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the processed protein sequences;
  • unique_<input>.fasta - this file contains the potentially new Cry toxins, differing from Bt nnomenclature afrer diamond blastp;
  • annotation_table_<input>.tsv - the result of the ipg annotation of the found toxins (the structure of the file is described above);
  • <input>_full_nucl.fna - the full nucleotide sequences for the found toxins;
  • <input>_processed_nucl.fna - the processed nucleotide sequences for the found toxins;
  • nucl_domain_mapping_full_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the uprocessed nucleotide sequences;
  • nucl_domain_mapping_processed_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the processed nucleotide sequences;

Note, that in the files, marked with the raw prefix the initial accession numbers from the query are used as ids, while in the files, obtained after diamond blastp (<input>_full_nucl.fna, <input>processed_nucl.fna and unique<input>.fasta) the id structure is modified in the following way:

  • Cry53Aa1(40.7)_WP_103591149.1
  • <top Cry protein hit from Bt nomenclature>(<the identity score with this hit>)_<the initial accession number>

References

  1. Bankevich A., Nurk S., Antipov D., Gurevich A., Dvorkin M., Kulikov A. S., Lesin V., Nikolenko S., Pham S., Prjibelski A., Pyshkin A., Sirotkin A., Vyahhi N., Tesler G., Alekseyev M. A., Pevzner P. A. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 2012
  2. Buchfink B., Xie C. & Huson D.H. Fast and Sensitive Protein Alignment using DIAMOND. Nature Methods, 2015
  3. Finn R.D., Clements J., Eddy S.R. HMMER Web Server: Interactive Sequence Similarity Searching. Nucleic Acids Research, 2011.
  4. Nurk S., Meleshko D., Korobeynikov A., Pevzner P. A. metaSPAdes: a new versatile de novo metagenomics assembler. Genome Research, 2017

Author

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published