-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTD processing 3: handling output IDs when multiple ID prefixes are possible #585
Comments
We could have a "hasPrefix" option in the yaml under the output. If this is set to true, then it will expect the ID prefix in the output at the beginning of the ID (ie. like it would parse "MESH:0000" to see it is "MESH" id) instead of using the output id type from smartapi yaml. In this case, instead of the output id being labeled by its id type in the response mapping (like "MESH") it could be named something generic (like "OUTPUT"). This should at least fix the output problem. I could work on this feature. |
Work started on multiple-prefixes branch of smartapi-kg and api-response-transform |
@rjawesome sorry for the late reply. I'm having trouble understanding your proposal...could you provide an example of x-bte annotation edits you're proposing? And as I rethink this issue, I wonder if some discussion would help:
|
My proposal would just be in the yaml outputs section like so (only for outputs not inputs) outputs:
- semantic: Disease
hasPrefix: true Then in the response mapping you would put OUTPUT instead of a prefix like (MESH), ie. chemical2disease_1:
OUTPUT: data.DiseaseID ## HAS prefix, the ID type will be determined by the prefix
ctd_chemical_disease_interaction_types: data.DirectEvidence
pubmed: data.PubMedIDs |
Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like outputs:
- semantic: Disease
id: MESH
- semantic: Disease
id: OMIM Then the response mapping would look like chemical2disease_1:
OMIM: data.DiseaseIDomim ## omim disease id is located here in the json from api
MESH: data.DiseaseIDmesh ## mesh disease id is located here in the json from api
ctd_chemical_disease_interaction_types: data.DirectEvidence
pubmed: data.PubMedIDs In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output. |
The first feature (hasPrefix) is currently working in multiple-prefixes branch |
Another way to solve this problem could be using JQ Post processing + the hasPrefix/OUTPUT feature. JQ post processing could move all the ids into the same field in the json before response mapping. So the operation could have transformers:
wrap_jq: "{data: [.[] | if .DiseaseIDomim then .DiseaseID = "OMIM:" + .DiseaseIDomim else . end | if .DiseaseIDmesh then .DiseaseID = "MESH:" + .DiseaseIDmesh else . end]}" Then the usage of hasPrefix and OUTPUT in the response mapping would be exactly the same as the first proposal. |
@rjawesome could you pause your work on this particular issue? and keep the work specific to this issue on a separate branch from After talking with @tokebe, we agreed that there's some larger-scale issues that still have be worked out, like:
so I plan to write an issue and start discussions on that. I think after those discussions, it'll be clearer what the actual requirements / behavior we want for this issue is... [EDIT: oh, one thing for sure is that in this use case and similar situations (one field, multiple ID prefixes), processing of the raw API response WILL BE REQUIRED to organize the IDs by namespace] |
Intro: see intro section of #583 (comment). Originally noted in #558 (comment)
3. handling output IDs when multiple ID prefixes are possible
Some operations are commented-out because BTE isn't properly handling the output IDs when multiple ID prefixes are possible. This happens when the output is a disease ID (which can be MESH or OMIM) or a pathway ID (which can be REACT or KEGG).
For example, BTE will fail to recognize that the API response returned both MESH and OMIM Disease IDs and will instead assign all the Disease IDs to the one ID-prefix assigned by the operation.
Edit SmartAPI yaml + run BTE locally
In a local copy of the SmartAPI yaml, uncomment the
chemical2disease_1
andchemical2disease_2
operations (lines 125, 127, 211-231, 253-273, 548-551, 556-559).Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint):
CTD's raw response
During execution, BTE should generate this query to CTD.
In CTD's raw response, some Disease IDs are MESH like
MESH:D015746
/ Abdominal Pain and others are OMIM likeOMIM:610141
/ QT INTERVAL, VARIATION IN.BTE's current flawed response
BTE will do the operation with MESH-Disease-outputs and find the OMIM ID in CTD's response. It'll then strip the OMIM ID prefix off, and then assign it as a MESH ID. This will result in a flawed record -> Edge to
MESH:610141
(when the original ID wasOMIM:610141
/ QT INTERVAL, VARIATION IN).Then it'll do similar behavior with the operation with OMIM-Disease outputs and the MESH IDs in CTD's response. So there'll be flawed records -> Edges like this to
OMIM:D015746
(when the original ID wasMESH:D015746
/ Abdominal Pain).I think Biolink API / Monarch's post-query processing + SmartAPI yaml response-mapping (which is to fields that exist only after the post-processing) is able to handle this situation, so maybe a solution like that will work here. However, it's perhaps not ideal that multiple operations are written + the same query is done repeatedly for different post-query processing.
This problem is related to past discussions on supporting multiple ID prefixes/namespaces as input / output. I'm not sure how much refactoring of code / x-bte annotation would be needed for a general solution...
The text was updated successfully, but these errors were encountered: