This repo contains data to set up the mongo
database for genome nexus, as well as scripts to generate new data. When setting
up a database container for Genome Nexus, it is not required to generate new
data. Data for several reference genomes and Ensembl releases is available in
the data
folder.
The parent image is Bitnami's mongodb image, found here.
Current installed MongoDB version: 4.0.12
There's a mongo docker container that has all the data imported. You can use the docker compose file in the genome nexus repo itself to start both the web app and the database: genome nexus.
Run the script scripts/import_mongo.sh to import processed data files into a running database. When running this script, please specify:
MONGO_URI
: Mongo database address, for examplemongodb://127.0.0.1:27017/annotator
.REF_ENSEMBL_VERSION
: Reference Genome and Ensembl Release, for examplegrch37_ensembl92
orgrch38_ensembl92
. Files are imported fromdata/<refgenome_ensemblversion>/export/
.
Example:
MONGO_URI="mongodb://127.0.0.1:27017/annotator"
REF_ENSEMBL_VERSION="grch37_ensembl92"
./scripts/import_mongo.sh
This repository contains a pipeline to retrieve data for a specified reference genome and Ensembl build. Generated data is saved in:
data/
To generated data for a different reference genome and Ensembl Release, follow the instructions below.
The main driver of the data loading pipeline is the Makefile found in data/
. It will download relevant tables from Ensembl BioMart, Pfam, HGNC, and transforms to the proper format for MongoDB.
The Makefile will create and fill the directories
data/<refgenome_ensemblversion>/input
: Input tables retrieved from Ensembl Biomartdata/<refgenome_ensemblversion>/export
: Pipeline output, used by MongoDB.data/<refgenome_ensemblversion>/tmp
: Temp files.
The input
and export
folders are tracked by Git, while the tmp
folder contains the intermediate files and is not
tracked by Git.
There are few dependencies of this pipeline for either python or R.
The python dependencies can be installed from the file requirements.txt
:
cd scripts
pip install -r requirements.txt
For R there is only the dependency on the biomaRt library.
R -e "source('https://bioconductor.org/biocLite.R'); biocLite('biomaRt')"
Run the import pipeline using the command below. This will take a few hours to complete.
cd data
make all \
VERSION=grch37_ensembl92 \
GFF3_URL=ftp://ftp.ensembl.org/pub/grch37/release-92/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz
To change the reference genome to build the data files for, change the
VERSION
and GFF3_URL
variables accordingly (examples are in the Makefile).
If the pipeline crashes, for example when the Ensembl REST API is down, sometimes an empty file is created. To continue the pipeline, remove the empty file and run make all
again.
Additionally, mouse data can be processed to build a database for mouse. This is described here.
During this process, every transcript in data/<refgenome_ensemblversion>/input/ensembl_biomart_geneids.txt
is assessed to be either canonical or not, by querying the Ensembl REST API. This takes a while, because a maximum of 1000 transcripts can be queried at a time. Progress can be viewed by inspecting the temporary files created in data/<refgenome_ensemblversion>/tmp/transcript_info
. Gene source file ensembl_biomart_geneids.txt
contains about 224596 transcripts, so the pipeline will save about 225 of these files.
When the REST API is slow for whatever reason, the server can return a timeout error. When that happens, the QSIZE
parameter can be used to reduce query size (e.g. to 100 transcripts at a time).
make all \
VERSION=grch37_ensembl92 \
GFF3_URL=ftp://ftp.ensembl.org/pub/grch37/release-92/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.gff3.gz \
QSIZE=100
To verify the pipeline ran data for the correct reference genome, you can verify the exon coordinates in
export/ensembl_biomart_transcripts.json.gz
. Select an Ensembl Exon ID, query it on Ensembl GRCh38 or GRCh37, select
the gene, select transcript, and select 'Exons'. This will display all the exons of the transcript and their genomic
coordinates.
When new data has been created, create a PR to Genome-Nexus to add this data to the master branch.