-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pathfinder Prototype #794
Comments
@colleenXu Please review and let me know if this aligns with your understanding and covers all the bases. Additionally, let me know if this explanation is sufficient for you to work on an example template. |
Please note that I'm currently asking for clarification regarding the 3rd example question. |
I have some feedback on the "Important Considerations". I think it'll be helpful for @tokebe and I to discuss... (1)I'm unsure of the assumption that the inferred-mode-handler will produce 1 mega-result with many support-graphs after running the templates. A template (n0 -> inter_1 -> n2) can return > 1 result if the intermediate nodes aren't set to Then I'm not sure if the inferred-mode-handler logic will continue to keep those results separate vs merge them into 1 mega-result… (2)I'm confused on how the number of support-graphs relates to the number of final-formatted results, ex: point 3's "count the number of support graphs, multiplying each by the number of intermediate nodes" I assumed that there'd be 1 final-formatted result per unique intermediate node…so if that intermediate node was in multiple template results (aka diff support-graphs), those would all be put together into that final-formatted result.
(3)In this situation + our current subclassing code, the subclasses of n0/n2 entity IDs will count as intermediate nodes. Is this a problem / an issue to ask Translator about? I'm not sure if the other teams have implemented this subclassing feature and will encounter this… (4)I'm not sure on the last point, because I thought the e2 support-graph would still differ between each final-formatted result. It sounds like the e2 support-graph is basically the union of the edge sets in the e0 and e1 subgraphs. AKA it's still a subgraph containing n0, 1 specific intermediate, and n2. |
Regarding point 1 of the "Important Considerations", I'm going to review the problem again to see if it's still relevant... |
Responding to @colleenXu's feedback:
|
UpdateJackson @tokebe: here's some slides based on our discussions of this pathfinder prototype so far. It's editable, so you should be able to adjust things. This should be useful for discussions, including with @rjawesome. On point (2):
Here's some stuff that came out of our 1-on-1 discussion today:
|
I'm putting the pathfinder template-groups and templates here: https://github.com/biothings/bte_trapi_query_graph_handler/tree/pathfinder-templates/data [EDIT: The notes below aren't using the potential answers Sui posted] Notes on Case A: how does imatinib affect asthma? (drug - disease)
Notes on Case B: how does resveratrol affect glyoxalase? (chemical - gene)
Possible answer notes
Notes on Case C: is there a possible genetic link between Crohn disease and Parkinson disease? (disease - disease)
Notes on Case D: What "molecular mechanisms" could explain the link between SLC6A20 and susceptibility to COVID19? (gene - disease)
Template guidelines:
|
@colleenXu Agreed, this is a good heuristic. @rjawesome Please note that we've finalized our expectation for how "final" results should be generated when iterating over template results (as inferred-handler result support graphs). This is detailed on slides 9-13 in the above linked slides. |
I have an update on point 1 of the "Important Considerations": I can't recreate the buggy behavior, so maybe things are fine? The previous buggy behavior was: I set up a Pathfinder TRAPI query with two starting IDs/nodes that we shouldn't find any results for - but instead results were returned that connected only to the first starting ID/node. (ref: lab Slack convo starting here) But I wasn't able to recreate this behavior using the current pathfinder-templates branch
But I did hit another bug, which didn't halt execution. I'll open another issue for it. |
Functionality should be finished in the Current test query that I have been using {
"message": {
"query_graph": {
"nodes": {
"n0": {
"ids": [
"PUBCHEM.COMPOUND:5291"
],
"categories": ["biolink:Drug"]
},
"un": {
"categories": [
"biolink:NamedThing"
]
},
"n2": {
"ids": [
"MONDO:0004979"
]
}
},
"edges": {
"e0": {
"subject": "n0",
"object": "un",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e1": {
"subject": "un",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e2": {
"subject": "n0",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
}
}
}
}
} |
@colleenXu @rjawesome I've done a brief code review of the branch, the execution looks pretty straightforward and good to me, so it's on to testing. There are a couple of notes, which might be better discussed in a draft PR:
|
|
I checked out the pathfinder branch and I can't successfully build. Perhaps the issue is that this branch isn't merged with the latest main? Here's the error
|
I've added support for example/cases 2 (chem - gene) and 3 (disease - disease):
That branch is also has an update to one of the earlier templates (commit) AND is merged with the latest main. |
@rjawesome You'll have to pull in the latest from the |
@colleenXu The main branch should be merged now, which fixes the RedisClient error. Also, the new templates from |
@rjawesome: @colleenXu and I ran some testing on the imatinib-asthma example, and we're seeing some odd behavior:
|
This goes with Jackson's comment above. I think it's easiest to understand visually w/ screenshots. I'm comparing the pathfinder run to running just the template that it's using. Here's the full response jsons for both, which I viewed in a json-viewer and in ARAX-UI (import -> response): (Thankfully, this example query is pretty simple: 1 template ran, this template provides unique, single intermediate nodes in each result. So there's a 1-to-1 match between final pathfinder results and the template's results) Point 1 example: Everything related to this intermediate node should have been pruned, but it's all still there
This is the bottom result for the template. This intermediate node (FBLN5, NCBIGene:10516) should be removed from the KG, as well as the stuff associated with it (edges + aux-graphs that are unique to this intermediate node's pathfinder result, both the original template stuff and the pathfinder-constructed stuff). But they're still there in pathfinder-response: A KG Node Pathfinder edges and aux-graphs Point 2 example: pathfinder support-graph issues
This is showing the first template's result, with the intermediate node KIT (NCBIGene:3815). We expect e0's support-graph to include all the edges from imatinib (n0) to KIT (un), e1 to include the edge from KIT to asthma (n2), and e2 to include all the edges in this result. So then we look at the first pathfinder result... e0 has all the edges in the result (which is what we wanted for e2) |
Pruning has been added to pathfinder. Intermediate edges ( |
I've added support for the last example 4/D (gene - disease):
Should I make template adjustments directly in the |
@colleenXu Yes, I think that makes sense. It shouldn't cause any merge issues with any work done to code in the branch. |
@rjawesome I've reviewed your changes and each result edge looks nearly correct now. I see only one remaining problem -- the now correctly-aux-graph'd |
|
I think we are preserving the support-graph info for subclass-edges correctly. However, it's not showing up properly in the ARAX-UI. This is happening both for our "normal" creative-mode and our pathfinder responses. It's odd because I recall this stuff showing up properly in the past. Example from normal creative-mode
Saved response from running "treats"-creative mode for MONDO:0007035 (Acanthosis nigricans). The 4th result has a top-level creative-support-graph. When I go into that support-graph and then look at the pheno edges, all should have support-graphs based on their IDs. Instead, no info is shown - not even source info. Example from pathfinder Case A (imatinib-asthma)
The 5th result in the template run is PDGFRA. When you look at that template's run in ARAX-UI, you can see the support-graph/source info for one of the PDGFRA->asthma edges. But if you look at 5th pathfinder result in ARAX-UI (saved response), that same edge now doesn't show any info. When I dig into the pathfinder json, all the info for this subclass-edge/its linked support-graph seems to exist and be properly formatted. The subclass edge
the subclass support-graph
The support-graph's edges + subclass-disease node
Gene to subclass-disease
subclass-disease to main-disease
subclass-disease node exists as well
|
This post will be recording what tests I'm running, the response-jsons, basic response stats, and other notes. I'll raise errors/problems in separate comments. Basic testsclick to expand
Different starting query topologies (does it correctly throw error or continue execution):
imatinib -> Meckel syndrome, type 3 (MONDO:0011821): (chem - disease) NEGATIVE CONTROL from previous comment
CasesNoting my possible answers and Sui's possible answers. Case A (asthma) is an example of truncating the 1st template's results to get a 500 result set. Case A (allergic asthma) and D have results/intermediate nodes that were found in multiple templates (showing that the merging code worked as-intended). 2 Case A (chem - disease) examples
imatinib (PUBCHEM.COMPOUND:5291) -> asthma (MONDO:0004979) (saved response):
imatinib -> allergic asthma (MONDO:0004784) (saved response):
Case B (chemical - gene) - currently running only 1 template
Resveratrol (PUBCHEM.COMPOUND:445154) -> glyoxalase, GLO1 (NCBIGene:2739) (saved response)
Case C (disease - disease)
Crohn Disease (MONDO:0005011) -> Parkinson Disease (MONDO:0005180) (saved response)
Case D (gene - disease)
SLC6A20 (NCBIGene:54716) -> COVID19 (MONDO:0100096) (saved response)
|
A problem: pathfinder doesn't find templates for Case B (chem - gene). I'm not sure what's going on. Query I'm using
|
@colleenXu I'll be working on the pathfinder prototype this week as Rohan is unavailable. Regarding ARAX UI problems, that might be worth reporting to them -- otherwise it's a good note that we should trust our own JSON analysis first. I'll take a look into the Case B issue. |
@tokebe Whoops I didn't set the pathfinder flag on the Case B template group. Added this in a recent commit. Haven't analyzed the behavior yet though. |
There's still a problem running Case B. The 2nd template runs quickly (1 min 1s), but returns a lot of results (4472). Inferred-mode then seems to get stuck on "merging" all of the results into 1 mega-result/creative-edge - it may take ~ 1 hour? And then Pathfinder also seems to get stuck finding the intermediate nodes (I didn't wait for it to complete). I was thinking of Case B as testing multiple things that don't happen with the other cases:
@tokebe For tomorrow's deployments, I've made a branch pathfinder-simpleCaseB that doesn't use the 2nd chem-gene template. BTE will then successfully run the chem-gene example (CaseB) - but it won't find much. |
Case B should be fixed. There was an unnecessary while loop that was causing the issues in the inferred mode handler. For the intermediate nodes, the "paths" involved were getting too long so I changed it so each "path" will only use edges from one template result (ie. each path will only include one pair of intermediate genes), but each intermediate node will merge all the "paths" that include it). Previously the paths were getting too long by combining many edges from different template results. |
Note: In the Translator Architecture 4/23 call, the UI team said they'll handle "4-hop paths" (aka 4 edges long). I think we'll stay at/under that limit with our current Pathfinder templates. All are 2-3 QEdges long. There's 1 potential case where BTE would generate 5-edge paths: if it ran the 2nd/3rd "Chem-Disease" templates (3 QEdges) and results involved descendants of both the chemical and the disease starting-ID (+2 |
@rjawesome Does your optimization change the output at all? |
It basically just limits the length of result "paths," so it doesn't compute graphs that have more hops then what is specified in the template (excluding subclass hops). |
So, if I'm understanding correctly, you've changed the implementation to be more like that specified in the slides (building the new aux graphs by iterating over each template result), whereas before you were merging multiple template results and then performing a DFS on them? If not, can you briefly describe the steps in your current implementation, comparing them to the approach in the slides? |
Yes. |
I think there's a problem! The new code is giving different output with less results, missing KG edges, and different scores. I saw this with Case A allergic asthma:
Actual Pathfinder TRAPI query
Here's what I found when digging in:
Expand to see logs
|
I accidentally introduced a bug when speeding up the while loop that assigned support graph suffixes in the inferred mode handler. Should be fixed now. |
It looks good! EDIT: First, I've reran all the "working" cases (not Case B). First I reran all the "working" cases (not Case B). For Case A allergic asthma (new saved response), I now see the same number of results (419), KG nodes and edges, and aux-graphs as before. And for all the cases, the interesting results from before are still present. I see some differences between the runs now and the previous runs, but I think these are okay:
Some cases ran faster than before:
Other cases ran slower than before:
|
Something else is going on with Pathfinder and Case B, and I can't tell if it's okay or a sign of a truncation problem. The good news is that it now ran both templates in 2 min 16 s (much better than running forever!). As a reminder, the second template returns >4000 results (>1000 nodes and >7000 edges) that needs truncating. Here's a Google Drive folder w/ my Case B Pathfinder run and the an old run of the 2nd template I'm comparing it to (it's not an exact match to the Pathfinder's 2nd template run, but I think it's close enough for what I want to demonstrate). What I'm seeing: while there's only 500 results in the Pathfinder run...
I didn't notice any truncation issues for Case A asthma and Case C (see my previous notes). |
There should definitely be a large number of nodes and edges that aren't bound to a result directly -- we'd expect a lot of nodes that are exclusively bound to an edge used in a support graph for an edge bound to a result, which could leave a lot of extra nodes and edges that don't have an immediately obvious reason for existing. It could still be the case that there are nodes and edges that aren't properly truncated, I think the only way we can meaningfully check this is by writing a script that parses a response and checks that every node/edge somehow links (directly or indirectly) to a result. It would have to start with results and then work its way out to build out lists of bound edges/nodes/support graph IDs, and then check those lists against the actual KG and support graph set. @rjawesome could you put together such a script? We'd probably want to adapt it to an integration test later, so it would see use beyond just checking this one time. |
I added a test here for pathfinder in particular: https://github.com/biothings/bte_trapi_query_graph_handler/blob/894bbb0e53148035ab73cd44ca4f22e3af5e6fb1/__test__/unittest/pathfinder.test.ts#L103-L146 |
Did some messing around with @rjawesome's test to make a script and was able to confirm that yes, pruning is working as expected. Case B just creates huge support graphs which results in many many edges. |
Related changes deployed to Prod as of 11/13 |
One priority for the current Translator sprint is a working Pathfinder prototype. This prototype must satisfy a specific input/output format, and should return adequate results for 4 example queries.
Problem Overview
Query format
The result format roughly matches the input format; 3 primary edges, with the two "pinned" query nodes and some intermediate node, and each edge being "artificial", with an associated support graph, as in preset inferred-mode queries.
Example result
Our 4 example queries are as follows:
The important differences are that:
Explaining further, for every intermediate node between the two pinned nodes, BTE must generate a result with that intermediate node as the unpinned node, and support graphs for edges on either side representing the rest of the path on either side of that node, as well as the "overall" edge having a support graph representing the full path.
This does mean that BTE will be generating many "redundant" results which bind essentially the same information (aside from the unpinned node) in different "view-frames".
Approach
In order to approach this problem within BTE's existing system, several steps must occur in query execution:
Recognize this specific query structure and enter a specific query execution mode/control-flow
Select templates separately from the existing templates. This may be accomplished by registering templates in the
templateGroups
file with the flag"pathfinder": true
and ensuring that flag is checked when obtaining Pathfinder templates.Fill out these templates and execute them in the normal inferred-mode way, resulting in a merged result set of each template.
Iterate over the existing result support graphs to generate a new results set with proper bindings and structure.
Important Considerations
These steps should be fairly straightforward to implement, with a few complications:
e2
edge (and associated support graph) for each result a given answer path generates.The text was updated successfully, but these errors were encountered: