Skip to content

Reference sequence databases

dickgroenenberg edited this page Feb 18, 2020 · 12 revisions

Genbank

Genbank is big and always growing (https://www.ncbi.nlm.nih.gov/genbank/statistics/). Because of the amount of sequences it takes long to blast against this reference. Most of the times not all the sequences are needed with identifying amplicon data. With the following commands you can make sub-selections of genbank.

Download nt database

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz

Create and download a taxonomy mapping file

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
sed '1d' nucl_gb.accession2taxid | awk '{print $2" "$3}' > accession_taxonid

Create sub-selections of the nt database

utilities/filter_nt.py

Now the selections are made, the fasta files need to be indexed

sudo makeblastdb2.8.0 -in CO1.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in 12S.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in ITS.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in 16S.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in matk.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids

Bacterial genomes

This will make a selection of complete or partial genome sequences. Marker genes can be present in bacteria but are not always annotated. This selection can help with detecting bacteria in samples.
Get taxidlineage file

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip
unzip -j new_taxdump.zip "taxidlineage.dmp"

Get accessions from bacterial sequences

python get_accession_of_taxonid.py -t taxidlineage.dmp -i 2 -a nucl_gb.accession2taxid -o bacterial_accessions