Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316

Open
andrewsu opened this issue Oct 8, 2021 · 21 comments

Comments

@andrewsu
Copy link
Member

andrewsu commented Oct 8, 2021

Tentatively labeling this a bug, but it may be an inherent limitation.

This query

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n2"
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "NCBIGene:2475"
                    ],
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "ids": [
                        "MONDO:0003406"
                    ],
                    "categories": [
                        "biolink:Disease",
                        "biolink:PhenotypicFeature"
                    ]
                }
            }
        }
    }
}

produces this result:
image

But when I simply flip the subject and object, the result has more edge provenance

image

Is there some inherent limitation in the smartAPI annotation on why this asymmetry has to exist?

@andrewsu andrewsu added the bug Something isn't working label Oct 8, 2021
@colleenXu
Copy link
Collaborator

colleenXu commented Oct 9, 2021

@andrewsu This is two parts:

  • how BTE "picks" the direction to query in (when there's potentially a "better" choice in direction)
  • a limitation of how the data is structured in the API (from my conversations with Kevin). This is a known issue to me, and I've wondered if there are solutions from a biothings suite (querying) or BTE api-response-transform ("filtering" post-query) side...

An explanation of the second point:

For the core biothings APIs, the data is organized by entity so MyDisease.info is organized by Disease. When querying from Disease -> Gene, we can look up everything under that disease's disgenet.genes_related_to_disease section, which includes all of the information in the second screenshot.

However, when we want to query from Gene -> Disease, we need to match the gene ID AKA a specific record under the disgenet.genes_related_to_disease section. However, a query will retrieve everything under that section (not just the specific record that has that gene ID) because the data is structured by disease.

For example, POST this query starting with the Gene NDUFA1 (4694) to https://mydisease.info/v1/query?fields=disgenet.xrefs,disgenet.genes_related_to_disease:

{
    "q": "4694",
    "scopes": "disgenet.genes_related_to_disease.gene_id"
}

The response includes diseases where ONE of their objects matches the query, but it includes ALL of the genes related to those diseases rather than only the objects that have the matching gene...

I hit a similar problem when trying to make more specific queries to map to more specific biolink predicates (like marker/mechanism under MyDisease's CTD Disease-Chemical information). I describe another example in the notes here. Because I get all the objects under the disease back rather than the matching objects only, I cannot make more specific queries...

@colleenXu
Copy link
Collaborator

colleenXu commented Dec 3, 2021

Returning to this: this is an inherent limitation from how these records are structured (and indexed and retrieved - the querying process). Going to propose closing this unless we plan to address it...

After discussion with Andrew 12/6, we decided to keep this open as a non-critical thing....to discuss + maybe work on when there is time...

@tokebe
Copy link
Member

tokebe commented May 4, 2022

If this issue can be addressed through api_response_transform or elsewhere in records handling, we might now be in a better position to address this?

@colleenXu
Copy link
Collaborator

@tokebe The last time I talked about it with Andrew, it seemed kinda hard...

I think this is a limitation imposed by the document-structure / biothings querying ability itself.

  • MyDisease is organized by disease. So querying with a disease will pull up an entire document as a hit (like its relationships to genes), and we want everything in that document
  • However, if we query using a gene (example: https://mydisease.info/v1/query?q=disgenet.genes_related_to_disease.gene_id:7157&fields=disgenet), we pull up the entire disease's document (its relationships to all other genes). There isn't a clear way to get just the part of the document that refers to that specific gene
  • In Slack, I was told that there isn't a good way to "get just this part of the document" using the query language alone; post-processing is needed.
  • Perhaps post-processing with an api-response-transform module can pull out the parts of documents but...
    • there'd need to be some awareness of what operations are "reversed" and need this post-processing
    • the "part of the document" we want to pull out might differ between operations / apis....since it depends on how the document is structured

@colleenXu
Copy link
Collaborator

This isn't an issue for "association-based" APIs, AKA where the structure is "one document per association" and all the info on the association is kept in a separate part of the document from the entity IDs.

As soon as a document has parts (like multiple associations in 1 document, each document represents 1 of the entity IDs)....this problem happens.

@tokebe
Copy link
Member

tokebe commented May 9, 2022

Hmm...this seems like it should be possible with post-processing in the transformer, but I agree that this would have to be basically on a per-API basis. We'd have to write new transformers for this, so it makes sense this should remain non-critical until we have more bandwidth.

@ericz1803
Copy link
Contributor

@tokebe Can you explain how it could be done with post-processing in the transformer? I was looking into this issue a bit and it seems like when querying from Gene->Disease, the disgenet score/information is missing so there would need to be other queries done to retrieve this information again.

Also, would it be possible/practical to have something that says that mydisease should always be queried starting from Disease?

@tokebe
Copy link
Member

tokebe commented Aug 23, 2022

I was under the impression that the issue is that querying Gene->Disease returns the the whole document, which we currently don't have the logic to pull out the disgenet score/information? this would be in the untransformedHits prior to going through the transformer.

If this isn't the case, then yes, we'd have to come up with some other method of retrieving the additional information. I'm not sure exactly how practical it might be to specifically query mydisease Disease-first always, though it might be relatively doable with a custom query builder. This would still require a custom transformer, however, and some additional logic to ensure records are created in the correct direction.

The preference would definitely be to post-process untransformedHits in a new transformer over custom querying logic, if possible.

@ericz1803
Copy link
Contributor

I did some more investigation and it doesn't grab the disgenet score/information at all when querying from Gene->Disease (the params pulled from the x-bte are completely different). I think what @colleenXu is saying above is this is a limitation of how the data is structured. So if we were to take the post-processing route, we would have to make a whole nother query to retrieve the disgenet score/info document, process that, then reincorporate it into the results.

Below are the query configs and the resulting unTransformedHits:
Screen Shot 2022-08-23 at 3 59 25 PM
Screen Shot 2022-08-23 at 3 38 58 PM

@tokebe
Copy link
Member

tokebe commented Aug 24, 2022

I suppose this makes the query-direction route more viable -- we'd need a separate query builder for mydisease that checks the subject/object semantic type and queries in reverse appropriately. It would have to somehow tag this such that the record is constructed in reverse of the query where appropriate as well.

Perhaps a reverseAfterQuery value that can be attached to the query_info object, which is passed to the transformer to instruct it to construct the record in reverse (anywhere else, such as reversing post-record-built, would cause issues with directionality down the line)

@colleenXu
Copy link
Collaborator

Err....I was out when this convo started but perhaps some more explanation / my perspective can help.

I am in agreement with Jackson's points here, and that working with the unTransformedHits is better.

But to develop that code, one will have to mutate the smartapi specs or work with a custom version of the smartapi yaml where the fields are specified differently (to retrieve all the info available in forward querying, during a reverse query).

Notice the query I give in my original post. This query doesn't have the same "fields" specified as the query in the x-bte annotation right now, because we don't have the features to correctly process it (it would just be extra data to send over the internet / ignore while processing).

@colleenXu
Copy link
Collaborator

Noting a related old discussion (internal lab Slack link): besides the "reverse" issue here, there's an issue of not being able to get a subset of the response. This is a problem when we want to treat those subsets differently (ex: assigning different biolink predicates or edge-attributes for the TRAPI response).

Also pasted below:

colleenxu
Oct 14 2021
is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...

Jerry
Oct 15 2021
colleenxu
not supported through biothings but if you can directly query es, you can use https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html

colleenxu
Oct 15 2021
I don't think I'll do direct es...but I don't understand how highlighting would do what I specified above. This sounds a bit closer to what I wanted to do...https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html#post-filter ....

Jerry
Oct 15 2021
Highlighting shows where the matches are, you do have to further process the result to use it. Post filter applies to the case when you use both search and aggregation, it's not what you're looking for. :) (edited)

colleenxu
Oct 15 2021
I see now, further processing would be needed to "filter out" stuff that didn't have the highlighting...

@colleenXu colleenXu changed the title edge provenance for mydisease.info differs depending on direction for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations Jan 31, 2023
@andrewsu andrewsu added enhancement New feature or request and removed bug Something isn't working labels May 3, 2023
@colleenXu
Copy link
Collaborator

Not clear if post-processing improvements (JQ-related, using biothings apis query abilities) will help overcome the issues here. Linking to #656, #489 and #521

@rjawesome
Copy link
Contributor

rjawesome commented Sep 29, 2023

JQ could be able to help.
For this example, the following wrap filter could be used: .disgenet.variants_related_to_disease |= list_filter_any(["source:CLINVAR"])
(since there is only one filter being used list_filter_all and list_filter_any would do the same thing)

Oct 14 2021
is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...

@colleenXu
Copy link
Collaborator

Noting that we previously decided list_filter could not be used to address the "reverses" issue: see the internal Slack discussion starting here. It's hard to paste the whole convo here, but I may do it later...

@colleenXu
Copy link
Collaborator

colleenXu commented Mar 18, 2024

Update

We still have issues with not being able to retrieve all the information on the association in "reverse" direction.

I was able to get MyChem aeolus count info to show in the reverse direction, by doing a non-batch query and using jmespath (only show the part of the json object that matches the starting ID).

I was also able to get MyChem Chembl treats reference info to show in the reverse direction (see commit).


But I wasn't able to use the same method to get the MyChem chembl drug-mechanism clinicaltrial info to show (drugMechChemblEnsembl-rev, drugMechChemblUniprot-rev):

  • When I try to use jmespath on chembl.drug_mechanisms.target_components.uniprot, I get a 500 error (GET query version).
    • But that query probably wouldn't do what I want anyways: it may only remove the chembl.drug_mechanisms.target_components section, when I want the entire chembl.drug_mechanisms section removed if chembl.drug_mechanisms.target_components.uniprot doesn't match the starting ID
  • If I just add the field to the response + response-mapping, BTE will include all the drug-mech reference info for that chemical, rather than just that specific chem-gene association
    • however, it haven't found an example where this is a problem. It'd be a problem when:
      • there's > 1 drug-mech object for a chemical
      • the refs for those two objects are different and specific to that chemical-gene-relationship
    • For example, this has a pubmed ref for the Kappa opioid receptor but not the Mu opioid receptor. But that pubmed's abstract actually refers to the mu opioid receptor...so it'd probably be fine to include it.
  • UPDATE 2024-03-26: discussed with Andrew. He says that he doesn't want to depend on this assumption that it's fine to mix together all the drug-mech reference info for a chem, when some may not apply to that specific chem-gene pair. So we'll treat this as a "not doable right now"
POST query version

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=chembl.molecule_chembl_id,chembl.drug_mechanisms&jmespath=chembl.drug_mechanisms.target_components|[?uniprot=='Q16602']' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["Q16602"],
  "scopes": ["chembl.drug_mechanisms.target_components.uniprot"]
}'

And My Variant's civic-geneDisease-rev has the same problem (example GET query with 500 error) but including all the info for the variant (rather than the variant-disease pair) would probably be more of a problem:

  • would remove only the civic.evidence_items.disease section rather than the whole civic.evidence_items section
  • in this example, the two evidence items are on different diseases with different, specific pubmed refs. So including all the info for the variant (rather than the specific variant-disease pair) would be more confusing.
POST query where I get the error

curl --location --globoff 'https://myvariant.info/v1/query?size=1000&fields=civic.entrez_id,civic.evidence_items&jmespath=civic.evidence_items.disease|[?doid=='DOID:9256']' \
--header 'Content-Type: application/json' \
--data '{
    "q": "DOID:9256",
    "scopes": "civic.evidence_items.disease.doid"
}'

@colleenXu
Copy link
Collaborator

Here's a list of the entity-based BioThings APIs (affected by the reverses issue):

  • MyChem
  • MyDisease
  • MyGene
  • MyVariant
  • BioThings AGR
  • BioThings DISEASES
  • BioThings EBIgene2phenotype
  • BioThings GO Biological Process
  • BioThings GO Cellular Component
  • BioThings GO Molecular Function
  • BioThings Foodb
  • BioThings HPO
  • BioThings iDISK
  • BioThings MGIgene2phenotype
  • BioThings RARe-SOURCE
  • BioThings Rhea
  • BioThings UBERON

@colleenXu
Copy link
Collaborator

colleenXu commented Mar 19, 2024

These are the reverse operations where the forward direction has publication info that would be nice to retrieve. I organized by what seems doable now with jmespath (related to #733?)

  • looks doable with jmespath
    • MyGene: BPToGene, MFToGene, CCToGene
    • MyDisease gene-disease, variant-disease, phenotype-disease, phenotype-disease2, chemical-disease, chemical-disease2
  • not sure how doable, haven't analyzed yet
    • BioThings EBIgene2phenotype
    • BioThings MGIgene2phenotype
    • BioThings RARe-SOURCE
  • not doable right now
    • MyChem drugMechChemblEnsembl-rev, drugMechChemblUniprot-rev (see discussion above)
    • MyVariant civic-geneDisease-rev (see discussion above), civic-variantDisease-rev (haven't tried yet, but should have basically the same problem)
    • MyGene geneToDisease: see post below

@colleenXu
Copy link
Collaborator

colleenXu commented Apr 2, 2024

I found a different jmespath issue with MyGene geneToDisease while working on #803

If I do this query, I get genes that match the disease, but I want to only keep the `clingen.clinical_validity` objects that have the matching disease.

Query:

curl --location 'https://mygene.info/v3/query?size=1000&fields=entrezgene%2Cclingen' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["MONDO:0100283"],
  "scopes": "clingen.clinical_validity.mondo"
}'

Example hits:

  • first hit has 2 clingen.clinical_validity objects, where 1 matches the disease I queried.
  • VS the second hit has 1 object
    {
        "query": "MONDO:0100283",
        "_id": "10000",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": [
                {
                    "classification": "definitive",
                    "classification_date": "2021-07-29T21:34:39.431Z",
                    "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes",
                    "gcep": "Brain Malformations",
                    "moi": "AD",
                    "mondo": "MONDO:0100283",
                    "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z",
                    "sop": "SOP7"
                },
                {
                    "classification": "limited",
                    "classification_date": "2021-10-26T15:00:30.155Z",
                    "disease_label": "microcephaly",
                    "gcep": "Brain Malformations",
                    "moi": "AD",
                    "mondo": "MONDO:0001149",
                    "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_6e3b524c-5d27-43d6-a0db-4f8f7cf1f872-2021-10-26T150030.155Z",
                    "sop": "SOP8"
                }
            ]
        },
        "entrezgene": "10000"
    },
    {
        "query": "MONDO:0100283",
        "_id": "5296",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": {
                "classification": "definitive",
                "classification_date": "2021-07-29T21:36:16.452Z",
                "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes",
                "gcep": "Brain Malformations",
                "moi": "AD",
                "mondo": "MONDO:0100283",
                "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_fc9a451e-0e75-47d2-a090-a2409732c465-2021-07-29T213616.452Z",
                "sop": "SOP7"
            }
        },
        "entrezgene": "5296"
    },

But when I add jmespath, the hits that had one clinical_validity object (with the matching disease) become null.

Query:

curl --location --globoff 'https://mygene.info/v3/query?size=1000&fields=entrezgene,clingen&jmespath=clingen.clinical_validity|[?mondo=='MONDO:0100283']' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["MONDO:0100283"],
  "scopes": "clingen.clinical_validity.mondo"
}'

Those same example hits:

  • first hit looks how I expect: there's now 1 clinical_validity object that matches the disease queried (vs 2 before)
  • VS the second hit now has null. But the clinical_validity object it had before matched the disease queried...
    {
        "query": "MONDO:0100283",
        "_id": "10000",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": [
                {
                    "classification": "definitive",
                    "classification_date": "2021-07-29T21:34:39.431Z",
                    "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes",
                    "gcep": "Brain Malformations",
                    "moi": "AD",
                    "mondo": "MONDO:0100283",
                    "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z",
                    "sop": "SOP7"
                }
            ]
        },
        "entrezgene": "10000"
    },
    {
        "query": "MONDO:0100283",
        "_id": "5296",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": null
        },
        "entrezgene": "5296"
    },

I suspect jmespath is having issue with the array (multiple clinical_validity objects) vs object (1 clinical_validity object) in the original document...

@colleenXu
Copy link
Collaborator

colleenXu commented Apr 2, 2024

Made issues for the jmespath stuff I'm seeing:

@colleenXu
Copy link
Collaborator

Potential breakthrough: using a new parameter jmespath_exclude_empty: true to remove hits that don't fit multiple criteria. Don't know if this is only live on MyChem or on all BioThings yet. See example #727 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants