Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate why BTE doesn't retrieve variant-disease relations from clinvar #548

Closed
andrewsu opened this issue Jan 13, 2023 · 11 comments
Closed
Labels

Comments

@andrewsu
Copy link
Member

Clinvar contains relationships between genetic variants and diseases (e.g., BRAF V600E -> melanoma), and that relationship appears to be captured in myvariant.info (e.g., http://myvariant.info/v1/variant/rs121913377). But I can't get this relationship via BTE when querying using any of these identifiers:

Note the DBSNP query gets results based on CIViC and Disgenet, but not clinvar. It appears that the clinvar fields are captured in the myvariant.info smartAPI annotation, but I can't quite figure out why those results aren't being captured.

TRAPI Query template
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["CLINVAR:362948"],
                    "categories": ["biolink:SequenceVariant"]
                },
                "n1": {
                    "categories": ["biolink:DiseaseOrPhenotypicFeature"]
                }
            },
            "edges": {
                "t_edge": {
                    "object": "n1",
                    "subject": "n0"
                }
            }
        }
    }
}
@rjawesome
Copy link
Contributor

rjawesome commented Jan 17, 2023

Looks like the SmartAPI annotation for myvariant.info clinvar uses the omim id to identify the disease. If you look at the myvariant info link you sent, the diseases lack an omim id but do have other id types (ie. mondo) which are not utilized by the smartapi annotation.
Looking at the melanoma result from the myvariant query you sent above:

{
   "accession":"RCV000442563",
   "clinical_significance":"Likely pathogenic",
   "conditions":{
      "identifiers":{
         "human_phenotype_ontology":"HP:0007474",
         "medgen":"C0025202",
         "mesh":"D008545",
         "mondo":"MONDO:0005105"
      },
      "name":"Melanoma"
   },
   ...
}

Meanwhile, if you test other clinvar relations they seem to work on bte (ie. the example DBSNP:rs1193171808 -> OMIM:615592 given on the smartapi annotation)

@andrewsu
Copy link
Member Author

Thanks @rjawesome for this careful diagnosis. Makes sense! So in addition to the OMIM mapping in the SmartAPI annotation in the x-bte-response-mapping section, can we also add additional mappings for HPO, MESH, and MONDO so the original BRAF V600E -> melanoma example would also be retrieved by BTE?

@rjawesome
Copy link
Contributor

It seems mesh and mondo are not indexed by myvariant so I don't know if that is queryable. Right now I have made a pull request to add HPO.

@colleenXu
Copy link
Collaborator

This can be partially addressed by adding more x-bte annotation (+ indexing fields if needed). However, this kind of "multiple prefixes/namespaces" issue is related to the #656

@colleenXu
Copy link
Collaborator

colleenXu commented Dec 6, 2023

Notes on the current situation

Added and deployed orphanet / hp operations . All operations passed manual testing, including clinvar-gene-phenoHP-rev and clinvar-variant-phenoHP-rev (affected by #756 (comment), which I described in the 2nd section of that comment).

However, this didn't address the original issue, because MyVariant's clinvar rcv entries for DBSNP:rs121913377 seem to use an HPO ID for melanoma that is wrong or outdated: HP:0007474. Those entries don't have omim / orphanet fields either. (example in Rohan's previous comment).

more examples of strange IDs (HPO, Orphanet, MedGen)

I noticed multiple kinds of clinvar rcv disease IDs that seemed to be wrong / outdated, but I haven't checked clinvar or OLS to see if the IDs are also wrong there (vs something going on in MyVariant parsing?).

HPO:

Orphanet:

MedGen:

Another issue is that this set of operations (omim, orphanet, hp) only covers 48% of the dataset (1038239 / 2162597)

Possible next steps

  • Use mondo/mesh namespaces: but we'd need to wait until the fields are indexed (issue)
    • don't know how much data they cover because _exists_ queries don't work...so it's unclear how much they'll improve the situation
    • I can write operations but the reverse ones won't work (starting with mondo/mesh field values)
  • Use medgen namespace: covers almost all records (only 2 records not covered). But...
    • Translator doesn't support the namespace
    • something is going on with the IDs (see collapsed section above)

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 10, 2024

The mondo/mesh namespaces have now been indexed biothings/myvariant.info#175 (comment) and I added x-bte operations to cover them NCATS-Tangerine/translator-api-registry@d4228a7

Now:

  • the clinvar x-bte annotation covers 67% of the dataset (1452368/)2162597 and 5 namespaces (omim, orphanet, hp, mondo, mesh).
  • the original issue is addressed: BTE will find the edge between BRAF V600E (DBSNP:rs121913377) and melanoma from clinvar, using the mondo and mesh x-bte operations
original query and current response

send a POST request to the api-specific endpoint, MyVariant only. Like http://localhost:3000/v1/smartapi/09c8782d9f4027712e65b95424adba79/query.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["DBSNP:rs121913377"],
                    "categories":["biolink:SequenceVariant"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Response will have this edge from clinvar connecting BRAF V600E to melanoma (MONDO:0005105).

                "2db8263141031ac84d9fea9c457ebba6": {
                    "predicate": "biolink:related_to",
                    "subject": "DBSNP:rs121913377",
                    "object": "MONDO:0005105",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:clinvar",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:myvariant-info",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:clinvar"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:myvariant-info"
                            ]
                        }
                    ]
                },


Last thing to do before closing this issue is to investigate the odd IDs (from the previous post)

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 10, 2024

On MyVariant clinvar data's melanoma identifiers:

I think the identifier set is the same between records (variants)

{
  "identifiers": {
    "human_phenotype_ontology": "HP:0007474",
    "medgen": "C0025202",
    "mesh": "D008545",
    "mondo": "MONDO:0005105"
  },
  "name": "Melanoma"
}

I was concerned with the HP ID HP:0007474 (which SRI Node Norm didn't recognize) and the MedGen ID C0025202, which seems to be a UMLS ID. While I'd need to find the exact line(s) in the clinvar data file to confirm what is happening, I think these "odd" IDs come from the Clinvar data file and aren't necessarily "wrong".

What I found

  • I didn't find the melanoma IDs in the clinvar page for BRAF V600E. However, after clicking on the "Conditions" tab (near "Variant details" and "Gene(s)"), I got to the "Variation/condition record" page RCV000442563.1, where the melanoma IDs are:

    MONDO: MONDO:0005105; MeSH: D008545; MedGen: C0025202; Human Phenotype Ontology: HP:0002861

  • That's the same MedGen ID as in MyVariant...so maybe the original clinvar data file is also using this ID. But it's still confusing because that MedGen ID's page says C0025202 is the UMLS concept ID, vs the MedGen UID: 9944
  • The HP ID is different from the one in MyVariant: HP:0002861. This seems to be the proper HP ID for melanoma.
    • However, when I look up this ID in BioPortal, I see the ID in MyVariant, HP:0007474, as an alternative ID. But I don't know what an "alternative ID" means (perhaps that it's deprecated and shouldn't be used anymore?).
    • So...maybe the original clinvar data file uses this alternative ID? (Is it possible that the clinvar data file did change at some point to use the proper ID and MyVariant didn't recognize/incorporate the change?)


All the other "odd" HP IDs I saw in MyVariant's clinvar data are also alternative IDs

MyVariant is using:

I wonder if MyVariant can map these alternative IDs to their proper/main IDs, and use the proper/main IDs instead...

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 10, 2024

Regarding the "odd" orphanet IDs I saw in MyVariant's clinvar data...they probably come from the original clinvar data. But I wonder if it's possible to keep > 1 ID for a namespace, in cases where the clinvar data may provide multiple (see Example 1 where clinvar probably provides 2 IDs and one is correct).

Examples

Example 1: MyVariant is using orphanet:8378 for Autosomal recessive polycystic kidney disease (ARPKD).

Example 2: MyVariant is using orphanet:178330 for Heinz body anemia.

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 10, 2024

Finally, on the "odd" medgen IDs I saw in MyVariant's clinvar data...they probably come from the original clinvar data. BTE isn't using medgen namespaces because Translator doesn't seem to support it yet (biolink-model, node norm).

But I wonder:

  • do we want to keep the data where condition is "not provided" (medgen CN517202) in MyVariant or not?
  • did clinvar data file change at some point to use an updated ID (like medgen CN517202 -> C3661900 for "not provided") and MyVariant didn't recognize/incorporate the change?
Example 1: medgen CN517202 for "not provided"

MyVariant is using medgen CN517202 for the condition "not provided" in >800k records. This MedGen ID seems to be outdated, replaced by C3661900.

Example 2: MyVariant is using medgen C0005283 for beta Thalassemia (BTHAL)

Found the same situation as above with the melanoma medgen ID.

The MyVariant record for rs1847557333 and beta Thalassemia (BTHAL) match this RCV record - which is using the same MedGen ID. So maybe the original clinvar data file is also using this ID. But it's still confusing because that MedGen ID's page says C0005283 is the UMLS concept ID, vs the MedGen UID: 2611

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 10, 2024

So to summarize my ideas after the MyVariant clinvar disease ID analyses I did (above posts)...

I suspect the "odd" IDs are coming from the original clinvar ingest data.

But I wonder if there are parser changes that could help:

  • HP IDs: can we find these alternative IDs and map them to the proper/main IDs?
  • orphanet IDs: can we keep > 1 ID for a namespace, in cases where the clinvar data may provide multiple (see Example 1 where clinvar probably provides 2 IDs and one is correct)
  • medgen IDs:
    • do we want to keep the data where condition is "not provided" (medgen CN517202) in MyVariant or not?
    • did clinvar data file change at some point to use an updated ID (like medgen CN517202 -> C3661900 for "not provided") and MyVariant didn't recognize/incorporate the change?

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 17, 2024

Closing this issue because the original problem has been addressed with mondo/mesh namespace coverage.


As for the "odd MyVariant clinvar disease IDs" (summary in previous post):

  • @andrewsu proposed that updating MyVariant with the latest clinvar dataset may help with these issues (CC @newgene @everaldorodrigo)
  • after a MyVariant update, I could redo this investigation. If I notice the same problems, then I could open an issue in MyVariant's repo for further investigation + my proposed parser changes.

(also see lab Slack convo here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants