Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creative-mode "Explain-style" prototype for Pathfinder option (Translator Jan 2024 Relay) #771

Closed
colleenXu opened this issue Dec 28, 2023 · 8 comments
Labels
enhancement New feature or request needs discussion

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Dec 28, 2023

For the Jan 2024 Relay, Translator teams are supposed to bring prototypes for the next-creative-mode choices: Pathfinder and multi-curie. Our team has chosen to focus on Pathfinder.

This issue is for discussing / working on this prototype.


Current assumptions:

  • only 1 ID on each QNode, aka "finding paths between only two entities"
  • We can implement this by having BTE run "Explain queries", first 1 QEdge (are they directly connected), then 2 QEdges (1 intermediate node), etc...
  • supposed to be "on Dev" by Relay, but we may be fine if we don't get this done: bring a set of template queries, plans on how we would implement it and how difficult/easy it is
  • if need be, we could narrow scope (categories for starting nodes or intermediate nodes, predicates?)
@colleenXu colleenXu added enhancement New feature or request needs discussion labels Dec 28, 2023
@colleenXu
Copy link
Collaborator Author

colleenXu commented Dec 29, 2023

Running Explain-type queries for the imatinib use cases

Summary:

  • It takes > 5 min to run the Explain query with 1 intermediate...
  • BTE has the expected gene results (highly-ranked too)
  • but lacks the explanatory intermediates like "mast cells", "immune cell activation", "cell cycle". I think this comes from (1) not much association data for biolink:Cell (cell types) and (2) the biolink:BiologicalProcessOrActivity association data (includes MolecularActivity, PhysiologicalProcess, PathologicalProcess, Pathway) mostly connects to Genes. Very little connects to other categories like the Disease / Chemical in these use cases.

imatinib ➡️ asthma

Should go through c-kit, mast cells, and immune cell activation.

I used these IDs using SRI Name Resolver: PUBCHEM.COMPOUND:5291 for imatinib, MONDO:0004979 for asthma.

There are direct edges for imatinib ➡️ asthma

First, I ran an Explain-query w/o any intermediates (1 QEdge connecting them):

  • it runs quickly, only 14 s
  • 4 Edges found:
    • treats: text-mining targeted
    • associated_with: multiomics ehr risk. with qualifiers, the statement is "imatinib is associated with decreased likelihood of asthma"
    • has_adverse_event: from automat drugcentral (faers) and mychem drugcentral (likely the same original data)

full response: imatinib-direct-asthma.json

click to see query-graph

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:5291"],
                    "name": "imatinib"
                },
                "n1": {
                    "ids":["MONDO:0004979"],
                    "name": "asthma"
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

imatinib ➡️ 1 intermediate ⬅️ asthma

Then, I ran an Explain-query w/ 1 intermediate QNode. full response here: imatinib-inter-asthma-4.json

click to see query-graph

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:5291"],
                    "name": "imatinib"
                },
                "n1": {
                    "categories":["biolink:NamedThing"]
               },
                "n2": {
                    "ids":["MONDO:0004979"],
                    "categories":["biolink:DiseaseOrPhenotypicFeature"],
                    "name": "asthma"
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": [
                        "biolink:related_to_at_instance_level",
                        "biolink:disease_has_location", "biolink:location_of_disease",
                        "biolink:composed_primarily_of", "biolink:primarily_composed_of",
                        "biolink:has_chemical_role",
                        "biolink:has_member", "biolink:member_of"
                    ]
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1",
                    "predicates": [
                        "biolink:related_to_at_instance_level",
                        "biolink:disease_has_location", "biolink:location_of_disease",
                        "biolink:composed_primarily_of", "biolink:primarily_composed_of",
                        "biolink:has_chemical_role",
                        "biolink:has_member", "biolink:member_of"
                    ]
                }
            }
        }
    }
}

Overall

  • it took 7 min to run locally (w/o threading or caching)
  • 718 results
  • Big Caveat: I also set the QNode with the asthma ID to the category DiseaseOrPhenotypicFeature. If I removed this category so BTE solely uses the top category from SRI NodeNorm, then KIT isn't in the results (see this response: imatinib-inter-asthma-3.json).
    • I suspect that the asthma - KIT edge from biolink-api/monarch is retrieved by the PhenotypicFeature (HP ID) -> Gene operations, not the Disease (MONDO ID) -> Gene operations.
  • Note: I set a predicate list for both QEdges to exclude ones that didn't seem useful or may create unhelpful self-edges: superclass_of, subclass_of, broad_match / narrow_match, close_match / exact_match / same_as

Expected results

  • KIT (Gene) is the top result! Note: only 1 edge for asthma ➡️ KIT from biolink-api (monarch) VS lots of imatinib ➡️ KIT edges from multiple sources
  • "mast cells": no exact match. There were some intermediate diseases that seemed related:
    • systemic mastocytosis - rank 21
    • mastocytosis - rank 34
    • aggressive systemic mastocytosis - rank 674
  • "immune cell activation": no exact match (I was looking for a related process / activity / pathway). But there are disease and gene intermediates that seem to be related to the immune system and to cell proliferation.
    • Genes PDGFRA (rank 2) and PDGFB (rank 55) seem related to asthma, immune regulation, and imatinib
    • immune system disorder intermediates like idiopathic hypereosinophilic syndrome (rank 33), eosinophilic pneumonia (rank 138)

intermediate node categories analysis

I searched the response using the console logs for the intermediate node categories, and got this list:

  • Disease, PhenotypicFeature
  • Gene
  • Chem: SmallMolecule, ChemicalEntity, MolecularMixture
  • 1 PhysiologicalProcess, pregnancy (imatinib contraindicated_for pregnancy and asthma correlated_with pregnancy)
  • 1 Procedure, liver transplantation (edges from multiomics ehr risk)

Interestingly, Pathway didn't show up at all - BiologicalProcess, MolecularActivity, PhysiologicalProcess, PathologicalProcess all did, before the intersecting of intermediate nodes.

Console logs for intermediate node categories

After getting the intersection of intermediate nodes, the final console log for the categories was:

bte:biothings-explorer-trapi:QEdge Collected entity ids in records: 
["Disease","PhenotypicFeature","PhysiologicalProcess","BiologicalProcess",
"SmallMolecule","ChemicalExposure","Drug","Gene",
"ChemicalEntity","Procedure","MolecularMixture"] +76ms

But before then, during the imatinib hop, the categories were:

  bte:biothings-explorer-trapi:QEdge Collected entity ids in records: 
["SmallMolecule","PhysiologicalProcess","Disease","PhenotypicFeature",
"Gene","Protein","PathologicalProcess","Procedure",
"ChemicalEntity","OrganismTaxon","Polypeptide","Cell",
"Phenomenon","Drug","MolecularActivity","DiseaseOrPhenotypicFeature",
"CellularComponent","GrossAnatomicalStructure","AnatomicalEntity","Plant",
"MolecularMixture","ComplexMolecularMixture"] +58ms

And during the asthma hop (before the intersecting began), the categories were:

  bte:biothings-explorer-trapi:QEdge Collected entity ids in records: 
["Disease","PhenotypicFeature","Device","ClinicalIntervention",
"Procedure","Event","ClinicalAttribute","PhysiologicalProcess",
"BiologicalProcess","Activity","ComplexMolecularMixture","Gene",
"Protein","ChemicalExposure","SmallMolecule","Drug",
"Publication","InformationContentEntity","PopulationOfIndividualOrganisms","EnvironmentalExposure",
"MolecularMixture","ChemicalEntity","SequenceVariant","OrganismAttribute"] +365ms

imatinib ➡️ CML (Chronic myelogenous leukemia)

Should go through BCR-ABL and cell cycle

I used these IDs using SRI Name Resolver: PUBCHEM.COMPOUND:5291 for imatinib, MONDO:0011996 for CML.

There are direct edges for imatinib ➡️ CML and its descendants

First, I ran an Explain-query w/o any intermediates (1 QEdge connecting them):

  • it runs quickly, only 13 s
  • lots of direct edges, including "treats"
  • also some edges to descendants

full response: imatinib-direct-cml.json

click to see query-graph

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:5291"],
                    "name": "imatinib"
                },
                "n1": {
                    "ids":["MONDO:0011996"],
                    "name": "cml"
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

imatinib ➡️ 1 intermediate ⬅️ CML

Then, I ran an Explain-query w/ 1 intermediate QNode. full response here: imatinib-inter-cml-2.json

click to see query-graph

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:5291"],
                    "name": "imatinib"
                },
                "n1": {
                    "categories":["biolink:NamedThing"]
               },
                "n2": {
                    "ids":["MONDO:0011996"],
                    "name": "cml"
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": [
                        "biolink:related_to_at_instance_level",
                        "biolink:disease_has_location", "biolink:location_of_disease",
                        "biolink:composed_primarily_of", "biolink:primarily_composed_of",
                        "biolink:has_chemical_role",
                        "biolink:has_member", "biolink:member_of"
                    ]
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1",
                    "predicates": [
                        "biolink:related_to_at_instance_level",
                        "biolink:disease_has_location", "biolink:location_of_disease",
                        "biolink:composed_primarily_of", "biolink:primarily_composed_of",
                        "biolink:has_chemical_role",
                        "biolink:has_member", "biolink:member_of"
                    ]
                }
            }
        }
    }
}

Overall

  • it took 6 min 46 s to run locally (w/o threading or caching)
  • 1661 results
  • Note: I set a predicate list for both QEdges to exclude ones that didn't seem useful or may create unhelpful self-edges: superclass_of, subclass_of, broad_match / narrow_match, close_match / exact_match / same_as

Expected results

  • BCR-ABL: separately, BCR and ABL1 are the top gene results (rank 2 and 3), with lots of edges to imatinib and to CML.
    • There's also related results like "Fusion Proteins, bcr-abl" (rank 770, umls id, from biothings semmeddb) and "Tyrosine-protein kinase ABL1 (ABL)" (not in top 1000, TTD.TARGET id, from biothings ttd)
  • "cell cycle": no exact match.
    • There may be Gene intermediates related to the cell cycle, like CCND1
    • There were some PhysiologicalProcess intermediates that seem related but they're general concepts, from biothings semmeddb only, and didn't score highly (not in top 1000 results unless otherwise noted):
      • cell proliferation (rank 992)
      • autophagy (rank 994)
      • apoptosis (rank 998)
      • growth
      • cell growth
      • cell survival
      • signal transduction
      • cell death

intermediate node categories analysis

I searched the response using the console logs for the intermediate node categories, and got this list:

  • Disease, PhenotypicFeature
  • Gene, Protein
  • Chem: SmallMolecule, Drug, MolecularMixture, ChemicalEntity
  • PhysiologicalProcess
  • Procedure
  • Cell (but entities and edges aren't helpful or interesting. Ex: both imatinib and CML are located in "bone marrow cells", edges from biothings semmeddb)
  • 1 AnatomicalEntity: blood (both imatinib and CML are located in the blood, edges from biothings semmeddb)
  • 1 MolecularActivity: "Down-Regulation" but the edges weren't helpful - imatinib causes down-regulation and CML includes down-regulation (edges from biothings semmeddb)
Console logs for intermediate node categories

After getting the intersection of intermediate nodes, the final console log for the categories was:

  bte:biothings-explorer-trapi:QEdge Collected entity ids in records: 
["Disease","Gene","SmallMolecule","Drug",
"MolecularMixture", "PhenotypicFeature","Procedure","ChemicalEntity",
"Polypeptide","PhysiologicalProcess","Protein","PathologicalProcess",
"AnatomicalEntity","GrossAnatomicalStructure","Cell","MolecularActivity"] +37ms

But before then, during the imatinib hop, the categories were (basically the same as for the imatinib-asthma testing)

  bte:biothings-explorer-trapi:QEdge Collected entity ids in records: 
["SmallMolecule","PhenotypicFeature","PhysiologicalProcess","Disease",
"Gene","Protein","PathologicalProcess","Procedure",
"ChemicalEntity","OrganismTaxon","Polypeptide","Cell",
"Phenomenon","Drug","MolecularActivity","DiseaseOrPhenotypicFeature",
"CellularComponent","GrossAnatomicalStructure","AnatomicalEntity","Plant",
"MolecularMixture","ComplexMolecularMixture"] +94ms

And during the CML hop (before the intersecting began), the categories were:

  bte:biothings-explorer-trapi:QEdge Collected entity ids in records: ["Disease","Gene","Procedure","SmallMolecule",
"Drug","ChemicalEntity","MolecularMixture","Polypeptide",
"PhenotypicFeature","SequenceVariant","Protein","Pathway",
"CellularComponent","NucleicAcidEntity","PhysiologicalProcess","PathologicalProcess",
"BiologicalEntity","Cohort","OrganismTaxon","Virus",
"Device","GrossAnatomicalStructure","AnatomicalEntity","Cell",
"PopulationOfIndividualOrganisms","MolecularActivity","ComplexMolecularMixture"] +103ms

@colleenXu
Copy link
Collaborator Author

colleenXu commented Jan 3, 2024

Basic implementation ideas

EDIT: after discussion with Andrew 1/8.

1. I think the creative-mode query would be like this (click to expand): QNodes aren't set with any biolink-category, the QEdge predicate is set to "related_to"

I think having no QEdge predicate would also make sense, but our current creative-mode won't run when I don't specify a predicate: I get 0 results and the warning log bte:biothings-explorer-trapi:inferred-mode Inferred Mode edge must specify a predicate. Your query terminates. +0ms.

This is for imatinib ➡️ CML (Chronic myelogenous leukemia)

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:5291"]
                },
                "n1": {
                    "ids":["MONDO:0011996"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"],
                    "knowledge_type": "inferred"
                }
            }
        }
    }
}

  1. BTE could look up the starting IDs with NodeNorm and retrieve the QNode categories early, before picking the matching templateGroups. Does BTE currently do this (I think it does)?
  2. BTE should then match the query to the matching Pathfinder templateGroups. All templateGroups will at least have the Explain-query w/ 0 intermediates (see if they or their descendants are directly connected).
click to see generic templates for 0 and 1 intermediates

Notes:

  • The templates don't set categories for the creativeQuerySubject / creativeQueryObject because I think it'd be best if BTE plugged in the categories from the NodeNorm ID-lookup. Right now, BTE doesn't seem to be doing this and raises an error (also noted in the next "Issues" section).
  • I set a predicate list for QEdges to exclude ones that didn't seem useful or may create unhelpful self-edges: superclass_of, subclass_of, broad_match / narrow_match, close_match / exact_match / same_as
  • I'm assuming that the intermediates are what matters, so it's better for the 1-intermediate template to be startingID1 ➡️ intermediate ⬅️ startingID2 (rather than a one-direction path from startingID1 ➡️ intermediate ➡️ startingID2)

First template: 0 intermediates

{
    "message": {
        "query_graph": {
            "nodes": {
                "creativeQuerySubject": {
                },
                "creativeQueryObject": {
               }
            },
            "edges": {
                "eA": {
                    "subject": "creativeQuerySubject",
                    "object": "creativeQueryObject",
                    "predicates": [
                        "biolink:related_to_at_instance_level",
                        "biolink:disease_has_location", "biolink:location_of_disease",
                        "biolink:composed_primarily_of", "biolink:primarily_composed_of",
                        "biolink:has_chemical_role",
                        "biolink:has_member", "biolink:member_of"
                    ]
                }
            }
        }
    }
}

Second template: 1 intermediate

{
    "message": {
        "query_graph": {
            "nodes": {
                "creativeQuerySubject": {
                },
                "nA": {
                    "categories":["biolink:NamedThing"]
                },
                "creativeQueryObject": {
               }
            },
            "edges": {
                "eA": {
                    "subject": "creativeQuerySubject",
                    "object": "nA",
                    "predicates": [
                        "biolink:related_to_at_instance_level",
                        "biolink:disease_has_location", "biolink:location_of_disease",
                        "biolink:composed_primarily_of", "biolink:primarily_composed_of",
                        "biolink:has_chemical_role",
                        "biolink:has_member", "biolink:member_of"
                    ]
                },
                "eB": {
                    "subject": "creativeQueryObject",
                    "object": "nA",
                    "predicates": [
                        "biolink:related_to_at_instance_level",
                        "biolink:disease_has_location", "biolink:location_of_disease",
                        "biolink:composed_primarily_of", "biolink:primarily_composed_of",
                        "biolink:has_chemical_role",
                        "biolink:has_member", "biolink:member_of"
                    ]
                }
            }
        }
    }
}

  1. For the first template (0 intermediates), BTE will have 1 result if there are edges between the two starting IDs. But BTE could return a lot of results when there's 1 intermediate (1 per unique intermediate).

Issues for implementation

A. Template-group matching -> Andrew agrees with this

I think Pathfinder queries should only use Pathfinder templates (when both QNodes have 1 ID and knowledge_type: inferred), and vice-versa

  • The Pathfinder templates wouldn't work for MVPs 1-2 because they're too general
  • The MVP1-2 templates won't work for Pathfinder because they use is_set: true for intermediate nodes, so each result represents a unique "answer" (the QNode at the open end of the Predict-type query). But for Pathfinder/Explain-type where we set both starting QNodes to specific IDs, using is_set: true for the intermediate nodes will collapse all the "answers" into 1 giant result - which I assume we don't want.

B. Odd bug(?) noticed before winter break

I'm not sure if this will be a problem if we adjust the template-group matching and test again with a Pathfinder-template-group + queries. But before winter break, I noticed that:

C. Problems setting up Pathfinder template-groups

With query-handler's inferred_explain branch checked out, I tried setting up a Pathfinder-template-group with the two generic templates I included above. But I encountered issues (also see my "recreating the problems" section below this list):

  • I thought I could set the templateGroup's subject / object to NamedThing so it'd work no matter what the starting-ID's category was. But when I tried this, BTE wouldn't use the template-group. If I set the subject / object to every biolink category, then BTE would use the template-group.
  • But then, BTE has an error bte:biothings-explorer-trapi:error_handler TypeError: queryGraph.nodes.creativeQuerySubject.categories is not iterable. I think it's because the generic Pathfinder-templates don't have QNode categories. I intended for BTE to plug in the categories from the NodeNorm ID-lookup...but this doesn't seem to happen here.
    • could be mitigated by having different templateGroups for different starting QNode categories and then setting those categories in the templates...but it's kinda redundant having multiple first templates (the 0-intermediate one)
      • So the Chem -> DiseaseOrPheno has direct edge, 1 intermediate, and specific ones. Put the starting IDs into the Chem + DoP QNodes so they start w/ those categories, then add the ones from NodeNorm
  • I'm not sure if either of these are intended behavior...for example, is there some problem we avoid by not expanding the subject / object categories?
recreating the problems

First, check out query-handler's inferred_explain branch and replace the contents of query-handler's templateGroups.json file with this (pnpm build after!):

[
  {
    "name": "Pathfinder: find paths between two entities",
    "subject": ["NamedThing"],
    "predicate": ["related_to"],
    "object": ["NamedThing"],
    "templates": [
      "pathfinder-direct.json",
      "pathfinder-1intermediate.json"
    ]
  }
]
Second, query BTE with this Pathfinder-style query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:5291"],
                    "name": "imatinib"
                },
                "n1": {
                    "ids":["MONDO:0011996"],
                    "name": "cml"
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"],
                    "knowledge_type": "inferred"
                }
            }
        }
    }
}

For me, BTE returned no results and the console log said bte:biothings-explorer-trapi:inferred-mode No Templates matched your inferred-mode query. Your query terminates. +0ms

Third, I changed the templateGroup.json contents to include every biolink category (pnpm build after!).

[
  {
    "name": "Pathfinder: find paths between two entities",
    "subject": ["Attribute","ChemicalRole","BiologicalSex","PhenotypicSex","GenotypicSex","SeverityValue","OrganismAttribute","PhenotypicQuality","Zygosity","ClinicalAttribute","ClinicalMeasurement","ClinicalModifier","ClinicalCourse","Onset","SocioeconomicAttribute","GenomicBackgroundExposure","PathologicalProcessExposure","PathologicalAnatomicalExposure","DiseaseOrPhenotypicFeatureExposure","ChemicalExposure","DrugExposure","DrugToGeneInteractionExposure","ComplexChemicalExposure","BioticExposure","EnvironmentalExposure","GeographicExposure","BehavioralExposure","SocioeconomicExposure","OrganismTaxon","Event","AdministrativeEntity","Agent","InformationContentEntity","StudyResult","ConceptCountAnalysisResult","ObservedExpectedFrequencyAnalysisResult","RelativeFrequencyAnalysisResult","TextMiningResult","ChiSquaredAnalysisResult","LogOddsAnalysisResult","StudyVariable","CommonDataElement","Dataset","DatasetDistribution","DatasetVersion","DatasetSummary","ConfidenceLevel","EvidenceType","Publication","Book","BookChapter","Serial","Article","JournalArticle","Patent","WebPage","PreprintPublication","DrugLabel","RetrievalSource","PhysicalEntity","MaterialSample","Activity","Study","Procedure","Phenomenon","Device","DiagnosticAid","PlanetaryEntity","EnvironmentalProcess","EnvironmentalFeature","GeographicLocation","GeographicLocationAtTime","BiologicalEntity","RegulatoryRegion","AccessibleDnaRegion","TranscriptionFactorBindingSite","BiologicalProcessOrActivity","MolecularActivity","BiologicalProcess","Pathway","PhysiologicalProcess","Behavior","PathologicalProcess","GeneticInheritance","OrganismalEntity","Bacterium","Virus","CellularOrganism","Mammal","Human","Plant","Invertebrate","Vertebrate","Fungus","LifeStage","IndividualOrganism","Case","PopulationOfIndividualOrganisms","StudyPopulation","Cohort","AnatomicalEntity","CellularComponent","Cell","GrossAnatomicalStructure","PathologicalAnatomicalStructure","CellLine","DiseaseOrPhenotypicFeature","Disease","PhenotypicFeature","BehavioralFeature","ClinicalFinding","Gene","MacromolecularComplex","NucleosomeModification","Genome","Polypeptide","Protein","ProteinIsoform","ProteinDomain","PosttranslationalModification","ProteinFamily","NucleicAcidSequenceMotif","GeneFamily","Genotype","Haplotype","SequenceVariant","Snv","ReagentTargetedGene","ChemicalEntity","MolecularEntity","SmallMolecule","NucleicAcidEntity","Exon","Transcript","RnaProduct","RnaProductIsoform","NoncodingRnaProduct","MicroRna","SiRna","CodingSequence","ChemicalMixture","MolecularMixture","Drug","ComplexMolecularMixture","ProcessedMaterial","Food","EnvironmentalFoodContaminant","FoodAdditive","ClinicalEntity","ClinicalTrial","ClinicalIntervention","Hospitalization","Treatment","NamedThing"],
    "predicate": ["related_to"],
    "object": ["Attribute","ChemicalRole","BiologicalSex","PhenotypicSex","GenotypicSex","SeverityValue","OrganismAttribute","PhenotypicQuality","Zygosity","ClinicalAttribute","ClinicalMeasurement","ClinicalModifier","ClinicalCourse","Onset","SocioeconomicAttribute","GenomicBackgroundExposure","PathologicalProcessExposure","PathologicalAnatomicalExposure","DiseaseOrPhenotypicFeatureExposure","ChemicalExposure","DrugExposure","DrugToGeneInteractionExposure","ComplexChemicalExposure","BioticExposure","EnvironmentalExposure","GeographicExposure","BehavioralExposure","SocioeconomicExposure","OrganismTaxon","Event","AdministrativeEntity","Agent","InformationContentEntity","StudyResult","ConceptCountAnalysisResult","ObservedExpectedFrequencyAnalysisResult","RelativeFrequencyAnalysisResult","TextMiningResult","ChiSquaredAnalysisResult","LogOddsAnalysisResult","StudyVariable","CommonDataElement","Dataset","DatasetDistribution","DatasetVersion","DatasetSummary","ConfidenceLevel","EvidenceType","Publication","Book","BookChapter","Serial","Article","JournalArticle","Patent","WebPage","PreprintPublication","DrugLabel","RetrievalSource","PhysicalEntity","MaterialSample","Activity","Study","Procedure","Phenomenon","Device","DiagnosticAid","PlanetaryEntity","EnvironmentalProcess","EnvironmentalFeature","GeographicLocation","GeographicLocationAtTime","BiologicalEntity","RegulatoryRegion","AccessibleDnaRegion","TranscriptionFactorBindingSite","BiologicalProcessOrActivity","MolecularActivity","BiologicalProcess","Pathway","PhysiologicalProcess","Behavior","PathologicalProcess","GeneticInheritance","OrganismalEntity","Bacterium","Virus","CellularOrganism","Mammal","Human","Plant","Invertebrate","Vertebrate","Fungus","LifeStage","IndividualOrganism","Case","PopulationOfIndividualOrganisms","StudyPopulation","Cohort","AnatomicalEntity","CellularComponent","Cell","GrossAnatomicalStructure","PathologicalAnatomicalStructure","CellLine","DiseaseOrPhenotypicFeature","Disease","PhenotypicFeature","BehavioralFeature","ClinicalFinding","Gene","MacromolecularComplex","NucleosomeModification","Genome","Polypeptide","Protein","ProteinIsoform","ProteinDomain","PosttranslationalModification","ProteinFamily","NucleicAcidSequenceMotif","GeneFamily","Genotype","Haplotype","SequenceVariant","Snv","ReagentTargetedGene","ChemicalEntity","MolecularEntity","SmallMolecule","NucleicAcidEntity","Exon","Transcript","RnaProduct","RnaProductIsoform","NoncodingRnaProduct","MicroRna","SiRna","CodingSequence","ChemicalMixture","MolecularMixture","Drug","ComplexMolecularMixture","ProcessedMaterial","Food","EnvironmentalFoodContaminant","FoodAdditive","ClinicalEntity","ClinicalTrial","ClinicalIntervention","Hospitalization","Treatment","NamedThing"],
    "templates": [
      "pathfinder-direct.json",
      "pathfinder-1intermediate.json"
    ]
  }
]

Then I tried the same query again. I got status 500 and these console logs:

  bte:biothings-explorer-trapi:inferred-mode Query proceeding in Inferred Mode. +0ms
  bte:biothings-explorer-trapi:inferred-mode Looking up query Templates +0ms
  bte:biothings-explorer-trapi:inferred-mode Got 2 inferred query templates. +10ms
  bte:biothings-explorer-trapi:error_handler TypeError: queryGraph.nodes.creativeQuerySubject.categories is not iterable
  bte:biothings-explorer-trapi:error_handler     at /Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler/built/inferred_mode/inferred_mode.js:168:70
  bte:biothings-explorer-trapi:error_handler     at Array.map (<anonymous>)
  bte:biothings-explorer-trapi:error_handler     at InferredQueryHandler.createQueries (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler/built/inferred_mode/inferred_mode.js:166:38)
  bte:biothings-explorer-trapi:error_handler     at async InferredQueryHandler.query (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler/built/inferred_mode/inferred_mode.js:402:28)
  bte:biothings-explorer-trapi:error_handler     at async TRAPIQueryHandler._handleInferredEdges (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler/built/index.js:505:39)
  bte:biothings-explorer-trapi:error_handler     at async TRAPIQueryHandler.query (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler/built/index.js:566:13)
  bte:biothings-explorer-trapi:error_handler     at async task (/Users/colleenxu/Desktop/biothings_explorer/packages/bte-server/built/routes/v1/query_v1.js:34:13)
  bte:biothings-explorer-trapi:error_handler     at async runTask (/Users/colleenxu/Desktop/biothings_explorer/packages/bte-server/built/controllers/threading/threadHandler.js:265:26)
  bte:biothings-explorer-trapi:error_handler     at async /Users/colleenxu/Desktop/biothings_explorer/packages/bte-server/built/routes/v1/query_v1.js:18:34 +0ms

D. Other implementation issues

  • The second template (1 intermediate) can take a long time to run (> 5 min). Perhaps it'd help if we implemented some of the timing cut-off ideas from group meeting?
  • With the second template (1 intermediate), we could end up with paths too long for the current UI to handle (I think it can only handle 3-edge long paths?). For example if both starting IDs have descendants, we could end up with 4-edge paths like startingID1 -> descendant1 -> intermediate <- descendant2 <- startingID2.
  • It'll be helpful to have more Pathfinder use-cases, including ones that aren't chem-disease or chem-gene. Are there any in the Feedback / QotM / old-demo stuff?

@colleenXu
Copy link
Collaborator Author

Data-source note from Andrew: perhaps a Cell marker database (gene <-> cell type and gene <-> tissue) like http://xteam.xbio.top/CellMarker/search.jsp?quickSearchInfo=c-kit would be helpful to add...

@Genomewide
Copy link

@colleenXu Do you have the data from these queries so we can play with it?

@colleenXu
Copy link
Collaborator Author

colleenXu commented Jan 12, 2024

@Genomewide

We don't have a full TRAPI response (running all the templates and merging the results into 1 set). The paths in the result sub-graphs may also be too long for the current UI to handle (> 3 edges, 4 nodes?).

You could try working with some of BTE's responses for the individual template-runs (you can ignore the extra notes, that's for our team):

  • imatinib → asthma with 0 intermediates
  • imatinib → Gene → Cell ← asthma: imatinib-gene-cell-asthma-2.json
    • 1 min 6 s, 314 results
    • All results are pretty low scoring, with lots of ties
    • only 4 unique cell entities: mast cell, eosinophil, t-lymphocyte,t-helper cell type 2
    • results w/ gene KIT:
      • KIT → mast cell is result 42
      • c-KIT → mast cell is result 63
  • imatinib → Gene → MolecularActivity ← asthma: imatinib-gene-molecularActivity-asthma.json
    • 1 min 21 s, 738 results
    • All results are pretty low scoring, with lots of ties
    • 13 unique molecular activities, most are fairly general terms: Phosphorylation, DNA Methylation, Signal Transduction Pathways, Up-Regulation (Physiology), Lipid metabolism, Biochemical Pathway, enzyme activity, immunoreactivity, histone acetylation, complement activation, receptor function, carbohydrate metabolism, arachidonic acid metabolic process
    • KIT → phosphorylation is result 666
  • imatinib → NamedThing ← asthma: imatinib-inter-asthma-4.json
    • for notes, look at the "imatinib ➡️ 1 intermediate ⬅️ asthma" section of the above post
    • note that it doesn't start out with a category on the imatinib QNode.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Jan 19, 2024

Queries ran locally for prototype presentation: https://docs.google.com/presentation/d/1gFFGJGumtHU_ktHKM2FKauTpC-0bvAh-H_ZI49-qDsI/edit?usp=sharing

all queries set imatinib as ChemicalEntity, disease as DiseaseOrPhenotypicFeature. I'm assuming BTE's templates would set the template placeholder nodes to these categories - which is how our templates/implementation currently work. We could adjust this to have no "template categories" in the future maybe?

(Ran on local instance, main branches + fix-776 branches for workspace/api-response-transform for #776. Also w/o threading or caching.)

1 intermediate

imatinib (ChemicalEntity) → NamedThing ← asthma (DiseaseOrPheno)

imatinib-inter-asthma-latest.json

  • 6 min 47 s, 949 results
  • top results are still KIT, PDGFRA

Screen Shot 2024-01-18 at 10 27 55 PM

imatinib (ChemicalEntity) → NamedThing ← CML (DiseaseOrPheno)

imatinib-inter-cml-latest.json

  • 5 min 37 s, 1546 results
  • top results are still BCR, ABL1

Screen Shot 2024-01-18 at 10 38 44 PM

Gene → Cell

See previous post for imatinib → Gene → Cell ← asthma

imatinib (ChemicalEntity) → Gene → Cell ← CML (DiseaseOrPheno)

imatinib-gene-cell-cml.json

  • 1 min 45 s, 1419 results
  • 15 unique entities: interesting ones are Hematopoietic stem cells, Blast Cell, Bone Marrow Cells, granulocyte, Pluripotent Stem Cells
    • others: cultured cell line, t-lymphocyte, stem cells, Lymphocyte, Neoplastic Cell, Clone Cells, K-562, Leukemic Cell, lymphoblast, Blood Cells
  • results w/ gene BCR:
    • BCR → blast cell is result 8 (also connected to hematopoietic stem cells in result 215, bone marrow cells in result 642)
    • fusion proteins, bcr-abl → hematopoietic stem cells is result 206
  • results w/ gene ABL1:
    • ABL1 → hematopoietic stem cells in result 180 (bone marrow cells in result 606)

Gene → PhysiologicalProcess,Pathway

imatinib (ChemicalEntity) → Gene → PhysiologicalProcess,Pathway ← Asthma (DiseaseOrPheno)

imatinib-gene-physiopath-asthma.json

  • 6 min 35 s, 1254 results
  • Doing this because
    • BiologicalProcess takes too long to run (>13 min w/ 21 unique intermediates) - these are the most promising children terms
      • most interesting is "IgE responsiveness, atopic"
      • KIT connected to edema (HP:0000969) and cardiac rhythm disease (MONDO:0007263), anaphylaxis (MONDO:0100053), respiratory arrest (HP:0005943)
    • also, MolecularActivity wasn't interesting (see previous post)
  • no exact matches for "immune cell activation", but some stuff is close
  • no pathways found
  • 38 physiologicalprocess terms, most were generic. Some interesting ones were:
    • immune response: results 166-195
    • bronchoconstriction: result 234
    • histamine release
    • t-cell activation
    • neutrophil infiltration
    • immune cell processes
    • host defense
    • antiviral response
    • cytokine production: results 236 - 245
    • Negative Regulation of Inflammatory Response Process

imatinib (ChemicalEntity) → Gene → PhysiologicalProcess, Pathway ← Asthma (DiseaseOrPheno)

imatinib-gene-physiopath-cml.json

  • 3 min 35 s, 3398 results
  • Doing this because BiologicalActivity would probably take too long to run
  • no exact matches for "cell cycle", but some stuff is close
  • 29 Pathways found! some interesting ones:
    • Cyclin D associated events in G1 (Homo sapiens) - reactome
    • pathways in cancer - bioplanet
    • Inhibition of cellular proliferation by Gleevec - bioplanet
    • Chronic myeloid leukemia - bioplanet
  • 12 physiologicalprocess terms. Some interesting ones:
    • cell proliferation: results 8-282. BCR in 102, ABL1 in 279.
    • lymphocyte activation (results 1-7)
    • mitotic metaphase
    • negative regulation of g2 phase

@Genomewide
Copy link

I did not respond to one of your previous comments, but 3 edges is fine. That is the max though. I look forward to seeing this!

@colleenXu
Copy link
Collaborator Author

Closing, pathfinder efforts are now in #794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion
Projects
None yet
Development

No branches or pull requests

2 participants