The search engine of any website can be one of the most useful tools for users to help them easily retrieve the information they are looking for. Currently, Ensembl’s search tool works based on indexed fields of their databases, that mainly covers key information, e.g. genes, species, proteins, including several synonyms for most of them. Ensembl would like to extend their new beta website search engine capabilities so users can have an even better experience.
We want to expand Ensembl beta website’s search functionality to include and support searching based on taxonomic information. In particular:
- Be able to find a species already present in Ensembl by searching by its scientific name or taxonomy (homotypic) synonym
- Be able to search for taxonomy clades and obtain the list of species available within such clade
- Return close relatives when the previous searches did not find any matches, i.e. if the term introduced is present in the taxonomic tree but no species are found in Ensembl, close-relative species available will be returned instead
The objective of this project is to provide a standalone Elasticsearch tool that can handle taxonomic-related requests.
Before you go ahead make sure you have Python 3.10 or higher version installed and activated in your system.
-
Install Python dependencies:
pip3 install -r requirements.txt
-
Define the environment variables in
config/env_vars
, e.g. you can add the connection URL for your Bonsai Elasticsearch cluster, and then activate those environment variables:export $(xargs < config/env_vars)
-
Run scripts to download the Ensembl metadata and NCBI taxonomy data into JSON files (as Django fixtures).
python3 scripts/get_ensembl_metadata.py python3 scripts/get_taxon_flat.py
-
Move into the source directory and run Django model migrations:
cd src python3 manage.py migrate
-
Load the previously downloaded fixtures into Django models:
python3 manage.py loaddata ensembl_metadata.json python3 manage.py loaddata ncbi_taxon_flat.json
-
Setup the Elasticsearch cluster and update settings if required. The current app is configured to use Bonsai as seen in step 2.
-
Index documents into Elasticsearch:
python3 manage.py search_index --rebuild
-
Start Elasticsearch server:
python3 manage.py runserver
-
Open the localhost url (http://127.0.0.1:8000/) in your browser to access the search taxonomy page.
scientific name
,synonym
andequivalent name
(contains informal synonyms of formal scientific names) will be the primary focus during the initial stage of the project.- We might want to also consider
misspelling
andmisnomer
, but only in the later stages (after above ones are completed). The same applies toanamorph
andteleomorph
, since these two are only applicable to Fungi. - It seems
acronym
is primarily used for the viruses, so not applicable to Ensembl for the time being.
Before you go ahead make sure you have docker installed and available in your system. The steps below are required only for the initial setup. To re-run an existing docker container, run docker start my_es_server
and enter the previously saved password if prompted.
-
Create the elasticsearch server:
docker network create elastic
-
Start the server and print the password (make a note of it, it will be printed only on the first run!):
docker run --name my_es_server --net elastic -p 9200:9200 -p 9300:9300 \ -e "discovery.type=single-node" \ -t docker.elastic.co/elasticsearch/elasticsearch:8.8.0
- Get exact match of existing species (including synonyms):
- Ixodes scapularis should return Ixodes scapularis, Ixodes scapularis ISE6 and Ixodes scapularis PalLabHiFi
- Metaseiulus occidentalis should return Galendromus occidentalis
- Apis terrestris should return Bombus terrestris
- Get species given the taxonomy clade:
- Culicinae should return Aedes aegypti, Aedes albopictus, Culex quinquefasciatus and Culex quinquefasciatus JHB
- Triatominae should return Rhodnius prolixus
- Hemichordata should return Saccoglossus kowalevskii
- Get closest relatives when the search term is not part of Ensembl:
- Seisonidae should return Adineta vaga (first common ancestor: Rotifera)
- Cenolia should return Anneisia japonica (first common ancestor: Comatulinae)
- Culex maxi should return Culex quinuefasciatus and Culex quinquefasciatus JHB (first common ancestor: Culex)
- Increase VM memory limit (on Windows WSL)
sudo sysctl -w vm.max_map_count=262144
- See/query environment variables
printenv <env-var-name>
- To query the database using Django models
python3 manage.py dbshell
- Make Django migrations
python3 manage.py makemigrations
- Apply Django migrations
python3 manage.py migrate