Reference sequence databases

Genbank

Genbank is big and always growing (https://www.ncbi.nlm.nih.gov/genbank/statistics/). Because of the amount of sequences it takes long to blast against this reference. Most of the times not all the sequences are needed with identifying amplicon data. With the following commands you can make sub-selections of genbank.

Download nt database

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz

Create and download a taxonomy mapping file

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
sed '1d' nucl_gb.accession2taxid | awk '{print $2" "$3}' > accession_taxonid

Create sub-selections of the nt database

utilities/filter_nt.py

Now the selections are made, the fasta files need to be indexed

sudo makeblastdb2.8.0 -in CO1.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in 12S.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in ITS.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in 16S.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in matk.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids

Bacterial genomes

This will make a selection of complete or partial genome sequences. Marker genes can be present in bacteria but are not always annotated. This selection can help with detecting bacteria in samples.
Get taxidlineage file

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip
unzip -j new_taxdump.zip "taxidlineage.dmp"

Get accessions from bacterial sequences

python get_accession_of_taxonid.py -t taxidlineage.dmp -i 2 -a nucl_gb.accession2taxid -o bacterial_accessions

Installation
Reference databases
- Genbank
- BOLD
- SILVA
- UNITE
- 16SMicrobial
- Private BOLD
- Custom
- Waterscan
Walkthrough manual

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference sequence databases

Genbank

Bacterial genomes

Clone this wiki locally