Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand genomes results to include assembly and linked objects #148

Merged
merged 15 commits into from
Jul 22, 2024

Conversation

bilalebi
Copy link
Contributor

@bilalebi bilalebi commented Jun 25, 2024

JIRA Ticket

Parent ticket: https://www.ebi.ac.uk/panda/jira/browse/EA-1191

Description

What changes have you made?

  • Expand the query to include assembly and linked objects (assembly, organism, species and regions' data) (EA-1210)
  • Alter the arguments to be more specific (EA-1222)
  • Expand the query to include Datasets info (EA-1226)
    • For genomes query
    • And genome query
  • See if you can optimise the speed
  • ⚠️ Point Thoas to the latest API tag before merging

Example:

How `genomes` query looked like before these changes (Expand to see)

Query

query {
  genomes(by_keyword: {keyword: "cat"}) {
    genome_id
    assembly_accession
    scientific_name
    release_number
    taxon_id
  }
}

Response

{
  "data": {
    "genomes": [
      {
        "genome_id": "14fe0744-4b47-4d31-bddf-dc2b6ffc7f4e",
        "assembly_accession": "GCA_000181335.4",
        "scientific_name": "Felis catus",
        "release_number": 110.1,
        "taxon_id": 9685
      }
    ]
  }
}

The New Query

Only one argument is needed, common_name is the one provided in the example below

query {
  genomes(by_keyword: {
    # tolid:""
    # assembly_accession_id:"GCA_902167145.1"
    # assembly_name:"Zm-B73-REFERENCE-NAM-5.0"
    # ensembl_name:"Zm-B73-REFERENCE-NAM-5.0"
    common_name:"maize",
    # scientific_name:"Zea mays"
    # scientific_parlance_name:"Maize"
    # species_taxonomy_id:"4577"
  })
  {
    genome_id
    assembly_accession
    scientific_name
    release_number
    release_date
    taxon_id
    assembly {
      assembly_id
      name
      accession_id
      regions {
        name
        length
        code
        topology
        metadata {
          ontology_terms {
            accession_id
            value
            url
            source {
              name
              url
              description
            }
          }
        }
      }
      organism {
        is_reference_organism
        scientific_name
        scientific_parlance_name
        species {
          taxon_id
        }
      }
    }
    dataset {
      dataset_id
      name
      release
      type
      source
      dataset_type
      version
      release_date
      release_type
    }
  }
}

Response

{
  "data": {
    "genomes": [
      {
        "genome_id": "4273b9f0-c927-4215-87bf-828ef65de984",
        "assembly_accession": "GCA_902167145.1",
        "scientific_name": "Zea mays",
        "release_number": 110.1,
        "release_date": "2023-10-18",
        "taxon_id": 4577,
        "assembly": {
          "assembly_id": "Zm-B73-REFERENCE-NAM-5.0",
          "name": "Zm-B73-REFERENCE-NAM-5.0",
          "accession_id": "GCA_902167145.1",
          "regions": [
            {
              "name": "8",
              "length": 182411202,
              "code": "chromosome",
              "topology": "linear",
              "metadata": {
                "ontology_terms": [
                  {
                    "accession_id": "SO:0000340",
                    "value": "chromosome",
                    "url": "www.sequenceontology.org/browser/current_release/term/SO:0000340",
                    "source": {
                      "name": "Sequence Ontology",
                      "url": "www.sequenceontology.org",
                      "description": "The Sequence Ontology is a set of terms and relationships used to describe the features and attributes of biological sequence. "
                    }
                  }
                ]
              }
            },
            .. very looooong list
          ],
          "organism": {
            "is_reference_organism": false,
            "scientific_name": "Zea mays",
            "scientific_parlance_name": "Maize",
            "species": {
              "taxon_id": 4577
            }
          }
        },
        "dataset": [
          {
            "dataset_id": "93633b44-0e59-49bf-8902-5acb41c5c74c",
            "name": "GCA_902167145.1",
            "release": 110.1,
            "type": "core_annotation",
            "source": "core",
            "dataset_type": "assembly",
            "version": "",
            "release_date": "10/18/2023",
            "release_type": "partial"
          },
          {
            "dataset_id": "b908f88f-a047-4da0-813e-f0a94b66877a",
            "name": "GCA_902167145.1_EXT01",
            "release": 110.1,
            "type": "core_annotation",
            "source": "core",
            "dataset_type": "genebuild",
            "version": "EXT01",
            "release_date": "10/18/2023",
            "release_type": "partial"
          },
          {
            "dataset_id": "cd0e290e-7986-4383-b545-5483ac3b1316",
            "name": "Zm-B73-REFERENCE-NAM-5.0",
            "release": 110.1,
            "type": "variation_annotation",
            "source": "vcf",
            "dataset_type": "variation",
            "version": "1.0",
            "release_date": "10/18/2023",
            "release_type": "partial"
          },
          {
            "dataset_id": "adf617f3-9e2b-49e0-921f-8bbe2c9ae750",
            "name": "Compara homologies ",
            "release": 110.1,
            "type": "compara_annotation",
            "source": "compara",
            "dataset_type": "homologies",
            "version": "1.0",
            "release_date": "10/18/2023",
            "release_type": "partial"
          },
          {
            "dataset_id": "b78f2b17-03e9-11ef-b152-005056b31e22",
            "name": "Compara homologies ",
            "release": 110.1,
            "type": "compara_annotation",
            "source": "compara",
            "dataset_type": "homologies",
            "version": "2.0",
            "release_date": "10/18/2023",
            "release_type": "partial"
          }
        ]
      }
    ]
  },
  "extensions": {
    "execution_time_in_seconds": 2.33,
    "metadata_api_version": "2.1.0a3"
  }
}
Deprecated Examples

Example 1: Expand the query to include assembly and linked objects

Query

query {
  genomes(by_keyword: {keyword: "cat"}) {
    genome_id
    assembly_accession
    scientific_name
    release_number
    taxon_id
    assembly {
      assembly_id
      name
      accession_id
      regions {
        name
        length
        code
        topology
        metadata {
          ontology_terms {
            accession_id
            value
            url
            source {
              name
              url
              description
            }
          }
        }
      }
      organism {
        is_reference_organism
        scientific_name
        scientific_parlance_name
        species {
          taxon_id
        }
      }
    }
  }
}

Response

{
  "data": {
    "genomes": [
      {
        "genome_id": "14fe0744-4b47-4d31-bddf-dc2b6ffc7f4e",
        "assembly_accession": "GCA_000181335.4",
        "scientific_name": "Felis catus",
        "release_number": 110.1,
        "taxon_id": 9685,
        "assembly": {
          "assembly_id": "Felis_catus_9.0",
          "name": "Felis_catus_9.0",
          "accession_id": "GCA_000181335.4",
          "regions": [
            {
              "name": "D2",
              "length": 90186660,
              "code": "chromosome",
              "topology": "linear",
              "metadata": {
                "ontology_terms": [
                  {
                    "accession_id": "SO:0000340",
                    "value": "chromosome",
                    "url": "www.sequenceontology.org/browser/current_release/term/SO:0000340",
                    "source": {
                      "name": "Sequence Ontology",
                      "url": "www.sequenceontology.org",
                      "description": "The Sequence Ontology is a set of terms and relationships used to describe the features and attributes of biological sequence. "
                    }
                  }
                ]
              }
            }
            ...
          ],
          "organism": {
            "is_reference_organism": false,
            "scientific_name": "Felis catus",
            "scientific_parlance_name": "Cat",
            "species": {
              "taxon_id": 9685
            }
          }
        }
      }
    ]
  }
}

Example 2: Alter the query params to be more specific

Query

Only one argument is needed, common_name is the one provided in the example below

query grpc_by_specific_keyword {
  genomes(by_keyword: {
    # tolid:""
    # assembly_accession_id:"GCA_902167145.1"
    # assembly_name:"Zm-B73-REFERENCE-NAM-5.0"
    # ensembl_name:"Zm-B73-REFERENCE-NAM-5.0"
    common_name:"maize",
    # scientific_name:"Zea mays"
    # scientific_parlance_name:"Maize"
    # species_taxonomy_id:"4577"
  })
  {
    genome_id
    assembly_accession
    scientific_name
    release_number
    release_date
    taxon_id
    assembly {
      assembly_id
      name
      accession_id
      regions {
        name
        length
        code
        topology
        metadata {
          ontology_terms {
            accession_id
            value
            url
            source {
              name
              url
              description
            }
          }
        }
      }
      organism {
        is_reference_organism
        scientific_name
        scientific_parlance_name
        species {
          taxon_id
        }
      }
    }
  }
}

Response

{
  "data": {
    "genomes": [
      {
        "genome_id": "4273b9f0-c927-4215-87bf-828ef65de984",
        "assembly_accession": "GCA_902167145.1",
        "scientific_name": "Zea mays",
        "release_number": 110.1,
        "release_date": "2023-10-18",
        "taxon_id": 4577,
        "assembly": {
          "assembly_id": "Zm-B73-REFERENCE-NAM-5.0",
          "name": "Zm-B73-REFERENCE-NAM-5.0",
          "accession_id": "GCA_902167145.1",
          "regions": [
            {
              "name": "8",
              "length": 182411202,
              "code": "chromosome",
              "topology": "linear",
              "metadata": {
                "ontology_terms": [
                  {
                    "accession_id": "SO:0000340",
                    "value": "chromosome",
                    "url": "www.sequenceontology.org/browser/current_release/term/SO:0000340",
                    "source": {
                      "name": "Sequence Ontology",
                      "url": "www.sequenceontology.org",
                      "description": "The Sequence Ontology is a set of terms and relationships used to describe the features and attributes of biological sequence. "
                    }
                  }
                ]
              }
            },
            ...
          ],
          "organism": {
            "is_reference_organism": false,
            "scientific_name": "Zea mays",
            "scientific_parlance_name": "Maize",
            "species": {
              "taxon_id": 4577
            }
          }
        }
      }
    ]
  }
}

I also sneaked in this ticket since Compara wanted to experiment with extracting all genes belonging to a given genome.

Changes

  • Altered resolve_genes() and made symbol in by_symbol optional
  • This will allow Compara to fetch all genes and related data (🔥 RIP Mongo! 🔥 ) for a given genome_uuid
  • The query might look something like this:
query get_protein_checksum {
  genes(by_symbol:{genome_id:"a7335667-93e7-11ec-a39d-005056b38ce3"})
  {
    stable_id
    transcripts {
      stable_id
      metadata {
        canonical {
          value
        }
      }
      product_generating_contexts {
        product {
          sequence {
            checksum
          }
        }
      }
    }
  }
}

@bilalebi bilalebi self-assigned this Jun 25, 2024
@bilalebi bilalebi changed the title Expand metadata Expand genomes results to include assembly and linked objects Jun 25, 2024
Copy link
Collaborator

@kamaldodiya kamaldodiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@@ -64,6 +64,7 @@ venv/
ENV/
env.bak/
venv.bak/
node_modules/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you get those? There isn't a package.json anywhere in the repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got them I was playing around with graphql federation.


genome(by_genome_uuid: GenomeUUIDInput!): Genome

}

input SymbolInput {
symbol: String!
symbol: String
Copy link
Collaborator

@azangru azangru Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be optional/nullable if this is specifically an input to search by symbol? How can one search by symbol without providing a symbol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is temporary, Botond from compara wants to experiment with fetching all genes data belonging to a given genome.

I figured the quickest way with minimal changes is to tweak the already existing query (Commit b942572)
We will definitely reverse the changes later.

Jira Ticket: https://www.ebi.ac.uk/panda/jira/browse/EA-1228

@bilalebi bilalebi merged commit 31697f3 into develop Jul 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants