Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BioThings suppKG: parser, x-bte, adding to BTE #706

Closed
colleenXu opened this issue Aug 18, 2023 · 10 comments
Closed

BioThings suppKG: parser, x-bte, adding to BTE #706

colleenXu opened this issue Aug 18, 2023 · 10 comments
Assignees
Labels
data source On Test Related changes are deployed to Test server x-bte

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Aug 18, 2023

Opening an issue here to better track the status of this effort.

Previous discussion in NCATS-Tangerine/translator-api-registry#122, with the currently-relevant comments starting NCATS-Tangerine/translator-api-registry#122 (comment) and biothings/pending.api#55 (comment)

Currently some concerns related to the data/parser...

@colleenXu
Copy link
Collaborator Author

Thanks to @mnarayan1, we have a SmartAPI yaml https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/suppkg/suppkg.yaml that covers supplement treatments for disease. We were able to use templated requestBody to generate a BioThings query structure that we haven't tried before: setting a field to multiple possible values using OR.

I've registered the SmartAPI yaml https://smart-api.info/registry?q=b48c34df08d16311e3bca06b135b828d

So it's now accessible through any BTE instance using the api-specific endpoints - but it's not used by the team-specific / ara-specific endpoints yet.

@colleenXu
Copy link
Collaborator Author

Here's a TRAPI query for "zinc supplement" -> disease
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["UMLS:C1268859"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Response: suppKG1.txt

An Edge in the response looks like this in the ARAX UI:
Screen Shot 2023-08-24 at 11 09 30 PM

@colleenXu
Copy link
Collaborator Author

colleenXu commented Aug 25, 2023

But....I still want to discuss the "UMLS:DC" IDs with @andrewsu (previous posts here and here), before moving forward.

I'm using an "ulcerative colitis" -> supplement response as my reference: suppkg2.txt

TRAPI query
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["UMLS:C0009324"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Analysis

The IDs may be real UMLS IDs, if you remove the "D".

click to see table
UMLS:DC ID suppKG Name Real UMLS ID UMLS Name notes
UMLS:DC0023791 (r)-dithiolane-3-pentanoic acid UMLS:C0023791 thioctic acid mapped ID names (R)-1,2-Dithiolane-3-pentanoic acid) are very similar to suppKG's name. This is more commonly known as alpha-lipoic acid
UMLS:DC0026370 black strap molasses UMLS:C0026370 molasses Hmm blackstrap molasses is a narrower concept
UMLS:DC0016157 1,200 mg UMLS:C0016157 fish oils
UMLS:DC0014839 aesculin UMLS:C0014839 esculin suppKG's name is in the mapped ID names
UMLS:DC0349374 arerra UMLS:C0349374 Cow's milk "arerra" is a synonym for fermented milk
UMLS:DC1141640 beesnest plant UMLS:C1141640 Carrots - dietary hmm...Bee's nest-plant is also called wild carrot or Queen Anne's lace

The UMLS ID names may match suppKG's associations

The edge for "1,200 mg" (UMLS:DC0016157) actually is about fish oils (UMLS:C0016157), and doesn't mention "1,200 mg"

                "5a6f30fdb2b0d8703c8d4bc8ff58ef96": {
                    "predicate": "biolink:treated_by",
                    "subject": "MONDO:0005101",
                    "object": "UMLS:DC0016157",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:30489199"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        },
                        {
                            "attribute_type_id": "biolink:supporting_text",
                            "value": [
                                "Therefore, realizing the need for safer and well tolerable alterative treatment approaches, currently, we evaluated the efficacy of n-3 fatty acids rich fish oil (FO) in the resolution of UC."
                            ]
                        }
                    ],

The edge for "fibersol-2" (UMLS:DC0032594) actually is about polysaccharides (UMLS:C0032594), and not fibersol-2

fibersol-2 is a brand supplement with fiber and maltodextrin, derived from corn

But the edge is actually about two different kinds of polysaccharides:

                "3b58d54615751c2a11c4f28660371a6a": {
                    "predicate": "biolink:treated_by",
                    "subject": "MONDO:0005101",
                    "object": "UMLS:DC0032594",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:30572047",
                                "PMID:23674951"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        },
                        {
                            "attribute_type_id": "biolink:supporting_text",
                            "value": [
                                "Efficacy of co-administration of modified apple polysaccharide and probiotics in guar gum-Eudragit S100 based mesalamine mini tablets: A novel approach in treating ulcerative colitis.",
                                "Our results showed that RTP had significant therapeutic effects on both UC and CD."
                            ]
                        }
                    ],

Other analysis: seems okay to use UMLS ID/name but other things are going on

The edge for "arerra" (UMLS:DC0349374) actually mentions cow milk (UMLS:C0349374). but it turns out "arerra" is an obscure name for the supplement

"arerra" is a synonym for fermented milk

                "7d60ab8033b02610a8209dfd5926be57": {
                    "predicate": "biolink:treated_by",
                    "subject": "MONDO:0005101",
                    "object": "UMLS:DC0349374",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:21525768"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        },
                        {
                            "attribute_type_id": "biolink:supporting_text",
                            "value": [
                                "Here, we examined the effects of a live Bifidobacterium breve strain Yakult, a probiotic contained in bifidobacteria-fermented milk, and galacto-oligosaccharide (GOS) as synbiotics in UC patients."
                            ]
                        }
                    ],

suppKG name + real UMLS name both don't match the paper: entity-resolution issue?

The Edge for beesnest plant (UMLS:DC1141640) isn't about bee's nest-plant/wild carrot/Queen Anne's lace. It also isn't about the food carrots (Carrots - dietary; UMLS:C1141640).

The paper is about Morinda officinalis aka Indian mulberry.

                "3dc0b3a041254bb526b3d75907063109": {
                    "predicate": "biolink:treated_by",
                    "subject": "MONDO:0005101",
                    "object": "UMLS:DC1141640",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:28824631"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        },
                        {
                            "attribute_type_id": "biolink:supporting_text",
                            "value": [
                                "The results demonstrated that the effects of MORE and MOHRE for the treatment of UC are similar, although there are a few difference on their chemical composition, indicating the hairy root cultured from <i>M."
                            ]
                        }
                    ],

@colleenXu
Copy link
Collaborator Author

Note that "moving forward" steps would be:

  • getting infores entries w/ wiki pages for infores:suppkg (primary) and infores:biothings-suppkg (aggregator)
  • making a PR to add to BTE's config file, deploying to dev and CI (not frozen)

@andrewsu
Copy link
Member

Per @erikyao 's comment here:

Hi @colleenXu , from SemRep_DS/docs/SemRep_full_fielded_output.txt:

*_CUI: The CUI of the subject/object entity. If a CUI starts with
'DC' instead of just 'C' it is an iDISK CUI and is not present in the UMLS.

It seems like the authors' intent is clear that "DC" IDs are meant to represent concepts for which they find no synonymous UMLS ID. @colleenXu, you've found many examples where it appears that there is a very tight connection between the "DC" ID and the corresponding UMLS ID. However, I don't think we have the time or expertise to be able to evaluate that linking exhaustively. Since the consequence of moving forward as-is is underlinking (rather than inclusion of false assertions, at least beyond the expected rate from a text-mined resource), I think we should go forward with that plan. So please proceed with the next steps you outlined in the preceding comment. Thanks!

@colleenXu
Copy link
Collaborator Author

colleenXu commented Sep 12, 2023

After discussion with Andrew (8/29?), we agreed to go forward with the DC IDs.

I followed my earlier post of "next steps to deployment":

  • infores catalog updates to add infores:biothings-suppkg and infores:suppkg are included in Cx infores edits biolink/biolink-model#1391 (oops I did extra). I created the infores wiki pages, and filled out the SuppKG one (primary source, so UI would use it)
  • PR to add BioThings SuppKG to config feat: add biothings suppkg to config #724
    • EDIT: however, I've asked for a pause of deploying this (to dev/CI for now). One reason is the new thoughts I have on "DC" terms (see next comment).

@colleenXu
Copy link
Collaborator Author

colleenXu commented Sep 12, 2023

@andrewsu @erikyao

I have another thought on the "DC" terms, but I don't know if @erikyao already investigated this...

Based on Yao's url https://github.com/zhang-informatics/SemRep_DS/blob/main/docs/SemRep_full_fielded_output.txt:

  • SuppKG maybe didn't originally have a prefix for these IDs.
  • The text says "If a CUI starts with 'DC' instead of just 'C' it is an iDISK CUI and is not present in the UMLS."

So I wonder if we'd want these "DC" terms in different fields of the BioThings SuppKG API. Right now, they're in subject.umls and object.umls, which is why x-bte annotation sets BTE up to add the UMLS prefix to these "DC" terms, when they're not UMLS CUIs...


And I was wondering if we know more about the "DC" terms, which may help us decide if they are a different namespace (and if so, what the prefix and other namespace info would be).

  • In a quick look in the suppKG paper, I see "Additionally, because of how DCUIs were assigned in iDISK, it is possible to map DS concepts with DCUIs to UMLS concepts with CUIs." This makes me wonder if (and how many) mappings exist between "DCUIs" and "UMLS CUIs"...and whether this could be added to the BioThings SuppKG API...
  • To understand these "DC" IDs more... this may involve digging thru the SuppKG paper and maybe the iDISK paper referenced

@colleenXu colleenXu self-assigned this Sep 13, 2023
@andrewsu
Copy link
Member

After reviewing this again, I think we should move forward with the "quickest path" solution -- keeping the DC IDs under subject.umls and object.umls. Yes, it results in invalid UMLS curies, but I think that's fine for the sake of expediency.

Also just noting for future reference that in the source file, there are 53707 IDs that start with C, and 2928 that start with D.

@colleenXu colleenXu added On Dev Related changes are deployed to Dev server On CI Related changes are deployed to CI server and removed needs discussion bug Something isn't working On Dev Related changes are deployed to Dev server labels Oct 20, 2023
@colleenXu
Copy link
Collaborator Author

Now being addressed by a different commit biothings/bte-server@58177d3. This is now deployed on dev/CI instances.

See Jackson's post here

@tokebe tokebe added On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Dec 20, 2023
@colleenXu
Copy link
Collaborator Author

Closing this issue since the changes have been deployed to Prod with the Feb 2024 release.

I've confirmed that I can query BioThings suppKG through BTE prod https://bte.transltr.io/v1/team/Service Provider/query with the example in #706 (comment) and get the expected response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source On Test Related changes are deployed to Test server x-bte
Projects
None yet
Development

No branches or pull requests

3 participants