for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316

andrewsu · 2021-10-08T06:12:48Z

Tentatively labeling this a bug, but it may be an inherent limitation.

This query

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n2"
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "NCBIGene:2475"
                    ],
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "ids": [
                        "MONDO:0003406"
                    ],
                    "categories": [
                        "biolink:Disease",
                        "biolink:PhenotypicFeature"
                    ]
                }
            }
        }
    }
}

produces this result:

But when I simply flip the subject and object, the result has more edge provenance

Is there some inherent limitation in the smartAPI annotation on why this asymmetry has to exist?

The text was updated successfully, but these errors were encountered:

colleenXu · 2021-10-09T05:44:25Z

@andrewsu This is two parts:

how BTE "picks" the direction to query in (when there's potentially a "better" choice in direction)
a limitation of how the data is structured in the API (from my conversations with Kevin). This is a known issue to me, and I've wondered if there are solutions from a biothings suite (querying) or BTE api-response-transform ("filtering" post-query) side...

An explanation of the second point:

For the core biothings APIs, the data is organized by entity so MyDisease.info is organized by Disease. When querying from Disease -> Gene, we can look up everything under that disease's disgenet.genes_related_to_disease section, which includes all of the information in the second screenshot.

However, when we want to query from Gene -> Disease, we need to match the gene ID AKA a specific record under the disgenet.genes_related_to_disease section. However, a query will retrieve everything under that section (not just the specific record that has that gene ID) because the data is structured by disease.

For example, POST this query starting with the Gene NDUFA1 (4694) to https://mydisease.info/v1/query?fields=disgenet.xrefs,disgenet.genes_related_to_disease:

{
    "q": "4694",
    "scopes": "disgenet.genes_related_to_disease.gene_id"
}

The response includes diseases where ONE of their objects matches the query, but it includes ALL of the genes related to those diseases rather than only the objects that have the matching gene...

I hit a similar problem when trying to make more specific queries to map to more specific biolink predicates (like marker/mechanism under MyDisease's CTD Disease-Chemical information). I describe another example in the notes here. Because I get all the objects under the disease back rather than the matching objects only, I cannot make more specific queries...

colleenXu · 2021-12-03T20:32:19Z

Returning to this: this is an inherent limitation from how these records are structured (and indexed and retrieved - the querying process). ~~Going to propose closing this unless we plan to address it...~~

After discussion with Andrew 12/6, we decided to keep this open as a non-critical thing....to discuss + maybe work on when there is time...

tokebe · 2022-05-04T17:52:41Z

If this issue can be addressed through api_response_transform or elsewhere in records handling, we might now be in a better position to address this?

colleenXu · 2022-05-06T23:37:15Z

@tokebe The last time I talked about it with Andrew, it seemed kinda hard...

I think this is a limitation imposed by the document-structure / biothings querying ability itself.

MyDisease is organized by disease. So querying with a disease will pull up an entire document as a hit (like its relationships to genes), and we want everything in that document
However, if we query using a gene (example: https://mydisease.info/v1/query?q=disgenet.genes_related_to_disease.gene_id:7157&fields=disgenet), we pull up the entire disease's document (its relationships to all other genes). There isn't a clear way to get just the part of the document that refers to that specific gene
In Slack, I was told that there isn't a good way to "get just this part of the document" using the query language alone; post-processing is needed.
Perhaps post-processing with an api-response-transform module can pull out the parts of documents but...
- there'd need to be some awareness of what operations are "reversed" and need this post-processing
- the "part of the document" we want to pull out might differ between operations / apis....since it depends on how the document is structured

colleenXu · 2022-05-06T23:39:55Z

This isn't an issue for "association-based" APIs, AKA where the structure is "one document per association" and all the info on the association is kept in a separate part of the document from the entity IDs.

As soon as a document has parts (like multiple associations in 1 document, each document represents 1 of the entity IDs)....this problem happens.

tokebe · 2022-05-09T15:50:35Z

Hmm...this seems like it should be possible with post-processing in the transformer, but I agree that this would have to be basically on a per-API basis. We'd have to write new transformers for this, so it makes sense this should remain non-critical until we have more bandwidth.

ericz1803 · 2022-08-23T19:05:21Z

@tokebe Can you explain how it could be done with post-processing in the transformer? I was looking into this issue a bit and it seems like when querying from Gene->Disease, the disgenet score/information is missing so there would need to be other queries done to retrieve this information again.

Also, would it be possible/practical to have something that says that mydisease should always be queried starting from Disease?

tokebe · 2022-08-23T20:37:15Z

I was under the impression that the issue is that querying Gene->Disease returns the the whole document, which we currently don't have the logic to pull out the disgenet score/information? this would be in the untransformedHits prior to going through the transformer.

If this isn't the case, then yes, we'd have to come up with some other method of retrieving the additional information. I'm not sure exactly how practical it might be to specifically query mydisease Disease-first always, though it might be relatively doable with a custom query builder. This would still require a custom transformer, however, and some additional logic to ensure records are created in the correct direction.

The preference would definitely be to post-process untransformedHits in a new transformer over custom querying logic, if possible.

ericz1803 · 2022-08-23T23:09:57Z

I did some more investigation and it doesn't grab the disgenet score/information at all when querying from Gene->Disease (the params pulled from the x-bte are completely different). I think what @colleenXu is saying above is this is a limitation of how the data is structured. So if we were to take the post-processing route, we would have to make a whole nother query to retrieve the disgenet score/info document, process that, then reincorporate it into the results.

Below are the query configs and the resulting unTransformedHits:

tokebe · 2022-08-24T17:08:34Z

I suppose this makes the query-direction route more viable -- we'd need a separate query builder for mydisease that checks the subject/object semantic type and queries in reverse appropriately. It would have to somehow tag this such that the record is constructed in reverse of the query where appropriate as well.

Perhaps a reverseAfterQuery value that can be attached to the query_info object, which is passed to the transformer to instruct it to construct the record in reverse (anywhere else, such as reversing post-record-built, would cause issues with directionality down the line)

colleenXu · 2022-08-26T07:02:09Z

Err....I was out when this convo started but perhaps some more explanation / my perspective can help.

I am in agreement with Jackson's points here, and that working with the unTransformedHits is better.

But to develop that code, one will have to mutate the smartapi specs or work with a custom version of the smartapi yaml where the fields are specified differently (to retrieve all the info available in forward querying, during a reverse query).

Notice the query I give in my original post. This query doesn't have the same "fields" specified as the query in the x-bte annotation right now, because we don't have the features to correctly process it (it would just be extra data to send over the internet / ignore while processing).

colleenXu · 2022-12-15T20:28:51Z

Noting a related old discussion (internal lab Slack link): besides the "reverse" issue here, there's an issue of not being able to get a subset of the response. This is a problem when we want to treat those subsets differently (ex: assigning different biolink predicates or edge-attributes for the TRAPI response).

Also pasted below:

colleenxu
Oct 14 2021
is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...

Jerry
Oct 15 2021
colleenxu
not supported through biothings but if you can directly query es, you can use https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html

colleenxu
Oct 15 2021
I don't think I'll do direct es...but I don't understand how highlighting would do what I specified above. This sounds a bit closer to what I wanted to do...https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html#post-filter ....

Jerry
Oct 15 2021
Highlighting shows where the matches are, you do have to further process the result to use it. Post filter applies to the case when you use both search and aggregation, it's not what you're looking for. :) (edited)

colleenxu
Oct 15 2021
I see now, further processing would be needed to "filter out" stuff that didn't have the highlighting...

colleenXu · 2023-09-13T01:08:51Z

Not clear if post-processing improvements (JQ-related, using biothings apis query abilities) will help overcome the issues here. Linking to #656, #489 and #521

rjawesome · 2023-09-29T00:14:44Z

JQ could be able to help.
For this example, the following wrap filter could be used: .disgenet.variants_related_to_disease |= list_filter_any(["source:CLINVAR"])
(since there is only one filter being used list_filter_all and list_filter_any would do the same thing)

Oct 14 2021
is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...

colleenXu · 2023-10-24T07:50:25Z

Noting that we previously decided list_filter could not be used to address the "reverses" issue: see the internal Slack discussion starting here. It's hard to paste the whole convo here, but I may do it later...

colleenXu · 2024-03-18T21:00:13Z

Update

We still have issues with not being able to retrieve all the information on the association in "reverse" direction.

I was able to get MyChem aeolus count info to show in the reverse direction, by doing a non-batch query and using jmespath (only show the part of the json object that matches the starting ID).

I was also able to get MyChem Chembl treats reference info to show in the reverse direction (see commit).

But I wasn't able to use the same method to get the MyChem chembl drug-mechanism clinicaltrial info to show (drugMechChemblEnsembl-rev, drugMechChemblUniprot-rev):

When I try to use jmespath on chembl.drug_mechanisms.target_components.uniprot, I get a 500 error (GET query version).
- But that query probably wouldn't do what I want anyways: it may only remove the chembl.drug_mechanisms.target_components section, when I want the entire chembl.drug_mechanisms section removed if chembl.drug_mechanisms.target_components.uniprot doesn't match the starting ID
If I just add the field to the response + response-mapping, BTE will include all the drug-mech reference info for that chemical, rather than just that specific chem-gene association
- however, it haven't found an example where this is a problem. It'd be a problem when:
  - there's > 1 drug-mech object for a chemical
  - the refs for those two objects are different and specific to that chemical-gene-relationship
- For example, this has a pubmed ref for the Kappa opioid receptor but not the Mu opioid receptor. But that pubmed's abstract actually refers to the mu opioid receptor...so it'd probably be fine to include it.
UPDATE 2024-03-26: discussed with Andrew. He says that he doesn't want to depend on this assumption that it's fine to mix together all the drug-mech reference info for a chem, when some may not apply to that specific chem-gene pair. So we'll treat this as a "not doable right now"

POST query version

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=chembl.molecule_chembl_id,chembl.drug_mechanisms&jmespath=chembl.drug_mechanisms.target_components|[?uniprot=='Q16602']' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["Q16602"],
  "scopes": ["chembl.drug_mechanisms.target_components.uniprot"]
}'

And My Variant's civic-geneDisease-rev has the same problem (example GET query with 500 error) but including all the info for the variant (rather than the variant-disease pair) would probably be more of a problem:

would remove only the civic.evidence_items.disease section rather than the whole civic.evidence_items section
in this example, the two evidence items are on different diseases with different, specific pubmed refs. So including all the info for the variant (rather than the specific variant-disease pair) would be more confusing.

POST query where I get the error

curl --location --globoff 'https://myvariant.info/v1/query?size=1000&fields=civic.entrez_id,civic.evidence_items&jmespath=civic.evidence_items.disease|[?doid=='DOID:9256']' \
--header 'Content-Type: application/json' \
--data '{
    "q": "DOID:9256",
    "scopes": "civic.evidence_items.disease.doid"
}'

colleenXu · 2024-03-19T18:02:51Z

Here's a list of the entity-based BioThings APIs (affected by the reverses issue):

MyChem
MyDisease
MyGene
MyVariant
BioThings AGR
BioThings DISEASES
BioThings EBIgene2phenotype
BioThings GO Biological Process
BioThings GO Cellular Component
BioThings GO Molecular Function
BioThings Foodb
BioThings HPO
BioThings iDISK
BioThings MGIgene2phenotype
BioThings RARe-SOURCE
BioThings Rhea
BioThings UBERON

colleenXu · 2024-03-19T18:04:19Z

These are the reverse operations where the forward direction has publication info that would be nice to retrieve. I organized by what seems doable now with jmespath (related to #733?)

looks doable with jmespath
- MyGene: BPToGene, MFToGene, CCToGene
- MyDisease gene-disease, variant-disease, phenotype-disease, phenotype-disease2, chemical-disease, chemical-disease2
not sure how doable, haven't analyzed yet
- BioThings EBIgene2phenotype
- BioThings MGIgene2phenotype
- BioThings RARe-SOURCE
not doable right now
- MyChem drugMechChemblEnsembl-rev, drugMechChemblUniprot-rev (see discussion above)
- MyVariant civic-geneDisease-rev (see discussion above), civic-variantDisease-rev (haven't tried yet, but should have basically the same problem)
- MyGene geneToDisease: see post below

colleenXu · 2024-04-02T00:14:12Z

I found a different jmespath issue with MyGene geneToDisease while working on #803

If I do this query, I get genes that match the disease, but I want to only keep the `clingen.clinical_validity` objects that have the matching disease.

Query:

curl --location 'https://mygene.info/v3/query?size=1000&fields=entrezgene%2Cclingen' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["MONDO:0100283"],
  "scopes": "clingen.clinical_validity.mondo"
}'

Example hits:

first hit has 2 clingen.clinical_validity objects, where 1 matches the disease I queried.
VS the second hit has 1 object

    {
        "query": "MONDO:0100283",
        "_id": "10000",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": [
                {
                    "classification": "definitive",
                    "classification_date": "2021-07-29T21:34:39.431Z",
                    "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes",
                    "gcep": "Brain Malformations",
                    "moi": "AD",
                    "mondo": "MONDO:0100283",
                    "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z",
                    "sop": "SOP7"
                },
                {
                    "classification": "limited",
                    "classification_date": "2021-10-26T15:00:30.155Z",
                    "disease_label": "microcephaly",
                    "gcep": "Brain Malformations",
                    "moi": "AD",
                    "mondo": "MONDO:0001149",
                    "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_6e3b524c-5d27-43d6-a0db-4f8f7cf1f872-2021-10-26T150030.155Z",
                    "sop": "SOP8"
                }
            ]
        },
        "entrezgene": "10000"
    },
    {
        "query": "MONDO:0100283",
        "_id": "5296",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": {
                "classification": "definitive",
                "classification_date": "2021-07-29T21:36:16.452Z",
                "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes",
                "gcep": "Brain Malformations",
                "moi": "AD",
                "mondo": "MONDO:0100283",
                "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_fc9a451e-0e75-47d2-a090-a2409732c465-2021-07-29T213616.452Z",
                "sop": "SOP7"
            }
        },
        "entrezgene": "5296"
    },

But when I add jmespath, the hits that had one clinical_validity object (with the matching disease) become null.

Query:

curl --location --globoff 'https://mygene.info/v3/query?size=1000&fields=entrezgene,clingen&jmespath=clingen.clinical_validity|[?mondo=='MONDO:0100283']' \
--header 'Content-Type: application/json' \
--data '{
  "q": ["MONDO:0100283"],
  "scopes": "clingen.clinical_validity.mondo"
}'

Those same example hits:

first hit looks how I expect: there's now 1 clinical_validity object that matches the disease queried (vs 2 before)
VS the second hit now has null. But the clinical_validity object it had before matched the disease queried...

    {
        "query": "MONDO:0100283",
        "_id": "10000",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": [
                {
                    "classification": "definitive",
                    "classification_date": "2021-07-29T21:34:39.431Z",
                    "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes",
                    "gcep": "Brain Malformations",
                    "moi": "AD",
                    "mondo": "MONDO:0100283",
                    "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z",
                    "sop": "SOP7"
                }
            ]
        },
        "entrezgene": "10000"
    },
    {
        "query": "MONDO:0100283",
        "_id": "5296",
        "_score": 10.205105,
        "clingen": {
            "_license": "https://www.clinicalgenome.org/docs/terms-of-use/",
            "clinical_validity": null
        },
        "entrezgene": "5296"
    },

I suspect jmespath is having issue with the array (multiple clinical_validity objects) vs object (1 clinical_validity object) in the original document...

colleenXu · 2024-04-02T01:09:32Z

Made issues for the jmespath stuff I'm seeing:

Problem with jmespath: unexpected behavior with single items (not arrays) biothings.api#323 affects current use, because it's removing expected answers and I can't make a special reverse for MyGene geneToDisease in TRAPI 1.5: support source_record_urls #803
jmespath: 500 error biothings.api#324: error encountered in specific cases
jmespath: removing higher-level objects based on lower-level matches biothings.api#325: potential extension/solution to problems?

colleenXu · 2024-05-22T06:30:50Z

Potential breakthrough: using a new parameter jmespath_exclude_empty: true to remove hits that don't fit multiple criteria. Don't know if this is only live on MyChem or on all BioThings yet. See example #727 (comment)

andrewsu added the bug Something isn't working label Oct 8, 2021

This was referenced Dec 21, 2022

more specific operations for MyChem chembl.drug_mechanisms data biothings/pending.api#100

Closed

New API based on MyChem drugcentral.bioactivity data biothings/pending.api#101

Closed

colleenXu changed the title ~~edge provenance for mydisease.info differs depending on direction~~ for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations Jan 31, 2023

andrewsu added enhancement New feature or request and removed bug Something isn't working labels May 3, 2023

colleenXu mentioned this issue May 12, 2023

Data source: RARe-SOURCE biothings/pending.api#109

Closed

colleenXu mentioned this issue Jul 21, 2023

x-bte annotation refactoring discussion #656

Open

colleenXu mentioned this issue Oct 4, 2023

x-bte operations: replace BioThings list_filter with jmespath #733

Closed

colleenXu added the jq / jmespath label Oct 18, 2023

colleenXu added the needs discussion label Oct 25, 2023

colleenXu added the x-bte label Oct 25, 2023

colleenXu mentioned this issue Oct 25, 2023

summary: x-bte-refactoring related issues #750

Open

colleenXu mentioned this issue Dec 22, 2023

Data source: repoDB biothings/pending.api#77

Closed

colleenXu mentioned this issue Mar 26, 2024

TRAPI 1.5: support source_record_urls #803

Closed

colleenXu mentioned this issue Apr 2, 2024

Problem with jmespath: unexpected behavior with single items (not arrays) biothings/biothings.api#323

Closed

colleenXu mentioned this issue Jul 30, 2024

Jmespath cleanup and followup #841

Open

colleenXu mentioned this issue Aug 28, 2024

not urgent: ideal KL/AT adjustments #858

Open

colleenXu mentioned this issue Dec 12, 2024

MyChem drugcentral bioactivity: write reverse operations #905

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316

for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316

andrewsu commented Oct 8, 2021

colleenXu commented Oct 9, 2021 •

edited

Loading

colleenXu commented Dec 3, 2021 •

edited

Loading

tokebe commented May 4, 2022

colleenXu commented May 6, 2022

colleenXu commented May 6, 2022

tokebe commented May 9, 2022

ericz1803 commented Aug 23, 2022

tokebe commented Aug 23, 2022

ericz1803 commented Aug 23, 2022

tokebe commented Aug 24, 2022

colleenXu commented Aug 26, 2022

colleenXu commented Dec 15, 2022

colleenXu commented Sep 13, 2023

rjawesome commented Sep 29, 2023 •

edited

Loading

colleenXu commented Oct 24, 2023

colleenXu commented Mar 18, 2024 •

edited

Loading

colleenXu commented Mar 19, 2024

colleenXu commented Mar 19, 2024 •

edited

Loading

colleenXu commented Apr 2, 2024 •

edited

Loading

colleenXu commented Apr 2, 2024 •

edited

Loading

colleenXu commented May 22, 2024

for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316

for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316

Comments

andrewsu commented Oct 8, 2021

colleenXu commented Oct 9, 2021 • edited Loading

colleenXu commented Dec 3, 2021 • edited Loading

tokebe commented May 4, 2022

colleenXu commented May 6, 2022

colleenXu commented May 6, 2022

tokebe commented May 9, 2022

ericz1803 commented Aug 23, 2022

tokebe commented Aug 23, 2022

ericz1803 commented Aug 23, 2022

tokebe commented Aug 24, 2022

colleenXu commented Aug 26, 2022

colleenXu commented Dec 15, 2022

colleenXu commented Sep 13, 2023

rjawesome commented Sep 29, 2023 • edited Loading

colleenXu commented Oct 24, 2023

colleenXu commented Mar 18, 2024 • edited Loading

Update

colleenXu commented Mar 19, 2024

colleenXu commented Mar 19, 2024 • edited Loading

colleenXu commented Apr 2, 2024 • edited Loading

colleenXu commented Apr 2, 2024 • edited Loading

colleenXu commented May 22, 2024

colleenXu commented Oct 9, 2021 •

edited

Loading

colleenXu commented Dec 3, 2021 •

edited

Loading

rjawesome commented Sep 29, 2023 •

edited

Loading

colleenXu commented Mar 18, 2024 •

edited

Loading

colleenXu commented Mar 19, 2024 •

edited

Loading

colleenXu commented Apr 2, 2024 •

edited

Loading

colleenXu commented Apr 2, 2024 •

edited

Loading