Skip to content

Reference sequence databases: Genbank

dickgroenenberg edited this page Mar 3, 2020 · 37 revisions

Pre-formatted blast database

Downloading the pre-formatted genbank nt BLAST database.

#version 4
ncbi-blast-2.8.1+/bin/update_blastdb.pl nt --passive
#version 5
ncbi-blast-2.8.1+/bin/update_blastdb.pl nt_v5 --blastdb_version 5 --passive
#unpack
for i in *.gz; do tar -xzvf $i; rm $i; done

To add the taxonomy to the blast results the scripts need a reference.

Sub-selections

Genbank is big and always growing (https://www.ncbi.nlm.nih.gov/genbank/statistics/). Because of the amount of sequences it takes long to blast against this reference. Most of the time not all the sequences are needed when identifying amplicon data. With the snakefile (utilities/genbank/Snakefile) you can create the sub-selections.
Caution
Please note that sub-selections are based on sequence headers. Sequences that are part of mitochondrial or chloroplast genomes will therefore not be present in these sub-selections.

First create a conda environment

conda env create -f utilities/snakemake37_environment.yml

Go to the utilities folder of genbank

cd galaxy-tool-BLAST/utilities/genbank

Activate the environment

conda activate snakemake37

To create the databases execute the snakefile (This pipeline has an output over 350GB)

snakemake -j 6

When the snakemake pipeline is done there will be an output folder containing folders for each sub-selection. You can move the folders to a destination of choice. To use the blast database in galaxy the path of the database need to be added to the blastn.xml file. See the example below.

<macro name="local_databases">
    <param name="database" type="select" multiple="true" label="Database">
        <option value="/home/galaxy/Tools/galaxy-tool-BLAST/utilities/silva/output/SILVA/18S.fa" label="18S">18S Genbank</option>
    </param>
</macro>

This galaxy blast tool can add taxonomy information to the blast hits. For the genbank references the files merged.dmp and rankedlineage.dmp are needed. The files are being downloaded by the snakefile and you can find them in the output/taxonomy folder. You can move them to a location of choice. The path of that location needs to be added to the blastn.sh file. See example below.

$SCRIPTDIR"/blastn_add_taxonomy.py" -i $outlocation'/files/' -t /home/galaxy/Tools/galaxy-tool-BLAST/utilities/genbank/output/taxonomy/rankedlineage.dmp -m /home/galaxy/Tools/galaxy-tool-BLAST/utilities/genbank/output/taxonomy/merged.dmp -ts "${9}" -taxonomy_db $outlocation"/taxonomy_db2" -bold_db $outlocation"/bold_db"