Google Summer of Code at the Tree of Life at the Wellcome Sanger Institute
Accepting proposals for Google Summer of Code 2024
Natural language search across the tree of life
The Tree of Life at the Wellcome Sanger Institute is generating high-quality genome assemblies as part of the Earth BioGenome Project (EBP), a global initiative to generate reference-quality genome sequences for all species on earth. Given the scale of this initiative, we need ready access to metadata relevant to sample collection, sequencing and assembly and a platform to coordinate our efforts with those of other projects under the EBP umbrella. To meet this need, we have developed Genomes on a Tree (GoaT), an Elasticsearch-based datastore, search engine, and reporting platform, with directly-measured or estimated values for a suite of attributes across all known species.
This project is about bridging the gap between the potential of GoaT to perform queries relevant to all stages of the biodiversity genomics projects within the EBP and users' ability to formulate these queries using the syntax that the existing API, CLI and front end UI require. To be able to directly answer questions like:
- Which plant families do not yet have a reference-quality genome assembly for any species? [UI result table]
- How many butterfly species without an assembly have an expected genome size greater than 1 billion base pairs? [API
/count
endpoint] - Which species on a project target list are already being sequenced by another EBP partner project? [API
/search
endpoint] - What proportion of reference-quality genome assemblies have been produced by EBP vs non-EBP projects in each of the last 5 years? [UI report view for 1 year]
GoaT is part of a broader collection of tools developed under the GenomeHubs project. A closely related tool, BoaT, indexes data within assemblies, and it is anticipated that development of GoaT-NLP will also benefit BoaT and further GenomeHubs projects still in development. All GenomeHubs source code is open source under the MIT license avaliable from the GenomeHubs GitHub organisation, primarily in the genomehubs/genomehubs repository. Configuration files to define the source data and customise the UI for GoaT are in the genomehubs/goat-data and the genomehubs/goat-ui repositories, respectively.
In order to support queries like the examples above, GoaT stores directly measured and estimated values for a range of attributes alongside taxonomic information including rank and lineage as a document per taxon in the datastore. The data structure for the taxon index is summarised below, other datatypes including assembly, sample and features are stored in separate indexes.
A processed taxon document can be obtained from the /record
endpoint of the API, e.g. /api/v2/record?recordId=9612&result=taxon or viewed in the UI by visiting the correwsponding taxon record page, e.g. /record?recordId=9612&result=taxon.
Each document has a core set of keyword fields:
taxon_id
- unique taxon ID in the current taxonomy (defaut NCBI taxonomy)parent
- taxon ID of the parent taxonscientific_name
- scientific name of the taxontaxon_rank
- rank of the taxon, e.g. species, genus, family, etc.
Additional fields are divided into three groups:
A set of nested fields for each name of the taxon:
name
- the taxon nameclass
- the taxon name class, e.g. scientific name, common name, etc.source
- the source of the taxon name, e.g. NCBI, GBIF, etc.
An ordered set of nested fields for each ancestor of the taxon:
taxon_id
- the unique ID of the ancestral taxontaxon_rank
- the rank of the ancestral taxonscientific_name
- the scientific name of the ancestral taxonnode_depth
- the depth of the ancestral taxon in the taxonomic tree
A set of nested fields for each attribute:
key
- the unique attribute name*_value
- the summary value of the attribute where*
is the attribute type which largely corresponds to the list of Elasticsearch field data typessource
- the source of the attribute, e.g. NCBI, GBIF, etc.min
,max
,mean
,median
- summary statistics for the attributeaggregation_method
- the aggregation method used to generate a summary valueaggregation_source
- the source of the attribute used to generate a summary value (direct, ancestor, descendant)
Each attribute value in the taxon index can be derived from one or more raw values, which are stored as a nested set of values in the values
field.
The full mapping used is defined in taxon.json
. Similar mappings are used for the other document types.
The query syntax currently used by GoaT it tied to this structured data model. It supports simple and highly-specific queries, but takes time to learn and presents a barrier to wider data access.
GoaT query syntax allows any combination of of tax_
filters and <attribute>
<operator>
<value>
clauses to be joined with AND
operators.
tax_
filters are used to restrict the taxonomic scope of a query as follows:
tax_name(<value>)
- return results where any taxon name or ID at the top-level or intaxon_names
matchesvalue
tax_tree(<value>)
- return results where the name or taxon ID of any taxon in thelineage
matchesvalue
tax_rank(<value>)
- return results where thetaxon_rank
matchesvalue
tax_depth(<value>)
- return results where thenode_depth
of any taxon in thelineage
<value
tax_lineage(<value>)
- return results for each ancestral taxon in thelineage
of a record where any taxon name or ID at the top-level or intaxon_names
matchesvalue
The operators supported are: =
, !=
, >
, >=
, <
, and <=
. A full list of available atttribute names, types and value constraints is available at goat.genomehubs.org/types.
Support for logical OR
operators is currently limited to the ability to provide a comma separated list of values for an attribute or tax_filter, in which case results will be returned if at least one value matches the query. For example, tax_tree(fungi,metazoa) AND long_list(DTOL,GAGA)
will return results for taxa in either the fungi or metazoa lineage and where the long_list
attribute contains either DTOL
or GAGA
.
Values of summary statistics can also be queried using the min
, max
, mean
, median
modifiers, e.g. using min(assembly_date)>=2023-01-01
to find taxa by the earliest assembly date.
GoaT-NLP aims to extend the capabilities of GoaT to support natural language queries. The project aims to:
-
Take natural language queries and convert them to structured queries using the GoaT query syntax.
-
Automatically select the most appropriate type of search to perform and return results as a natural language statement.
-
Augment Goat search results with extracts from unstructured text.
-
Extract information from text using machine learning models for indexing.
We are open to suggestions for further directions to develop this project and validation will be as important as information retrieval to ensure the results presented accurately reflect the intended query.
We are proposing the GoaT-NLP project as a Google Summer of Code project for 2024. If you are interested in contributing to GoaT-NLP, please read the information provided in the ToL+PaM GSoC 2024 Google Doc and use the information in that document to get in touch with any questions you may have.
We will assess applications from potential GSoC contributors on the basis of the proposal. Again, see the ToL+PaM GSoC 2024 Google Doc for more, but broadly, we want to know:
- how would you approach this project?
- which technologies would you use and why?
- what would be the key milestones and when would you reach them?
- how would you ensure the sustainability of your code beyond the end of the GSoC term?
You should follow the GSoC contibutor guidelines to help structure your proposal. Note that we'd like to see a diagram of your suggested implementation and while we have no fixed length limit, we value the ability to identify and focus on the core elements of your proposal and to write concisely.
- GoaT paper
- GoaT website
- API documentation
- GenomeHubs codebase
- BoaT website
- Tree of Life
- Earth BioGenome Project
- Google Summer of Code
- ToL+PaM GSoC 2024 Google Doc
GoaT was part of Biodiversity Genomics Academy 2023 (BGA23), watch the video tutorial or view the slides