-
Notifications
You must be signed in to change notification settings - Fork 1
Reference sequence databases
dickgroenenberg edited this page Feb 18, 2020
·
12 revisions
Genbank is big and always growing (https://www.ncbi.nlm.nih.gov/genbank/statistics/). Because of the amount of sequences it takes long to blast against this reference. Most of the times not all the sequences are needed with identifying amplicon data. With the following commands you can make sub-selections of genbank.
Download nt database
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
Create and download a taxonomy mapping file
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
sed '1d' nucl_gb.accession2taxid | awk '{print $2" "$3}' > accession_taxonid
Create sub-selections of the nt database
utilities/filter_nt.py
Now the selections are made, the fasta files need to be indexed
sudo makeblastdb2.8.0 -in CO1.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in 12S.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in ITS.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in 16S.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
sudo makeblastdb2.8.0 -in matk.fa -dbtype nucl -taxid_map accession_taxonid -parse_seqids
This will make a selection of complete or partial genome sequences. Marker genes can be present in bacteria but are not always annotated. This selection can help with detecting bacteria in samples.
Get taxidlineage file
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip
unzip -j new_taxdump.zip "taxidlineage.dmp"
Get accessions from bacterial sequences
python get_accession_of_taxonid.py -t taxidlineage.dmp -i 2 -a nucl_gb.accession2taxid -o bacterial_accessions