-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316
Comments
@andrewsu This is two parts:
An explanation of the second point: For the core biothings APIs, the data is organized by entity so MyDisease.info is organized by Disease. When querying from Disease -> Gene, we can look up everything under that disease's disgenet.genes_related_to_disease section, which includes all of the information in the second screenshot. However, when we want to query from Gene -> Disease, we need to match the gene ID AKA a specific record under the disgenet.genes_related_to_disease section. However, a query will retrieve everything under that section (not just the specific record that has that gene ID) because the data is structured by disease. For example, POST this query starting with the Gene NDUFA1 (4694) to https://mydisease.info/v1/query?fields=disgenet.xrefs,disgenet.genes_related_to_disease:
The response includes diseases where ONE of their objects matches the query, but it includes ALL of the genes related to those diseases rather than only the objects that have the matching gene... I hit a similar problem when trying to make more specific queries to map to more specific biolink predicates (like marker/mechanism under MyDisease's CTD Disease-Chemical information). I describe another example in the notes here. Because I get all the objects under the disease back rather than the matching objects only, I cannot make more specific queries... |
Returning to this: this is an inherent limitation from how these records are structured (and indexed and retrieved - the querying process). After discussion with Andrew 12/6, we decided to keep this open as a non-critical thing....to discuss + maybe work on when there is time... |
If this issue can be addressed through api_response_transform or elsewhere in records handling, we might now be in a better position to address this? |
@tokebe The last time I talked about it with Andrew, it seemed kinda hard... I think this is a limitation imposed by the document-structure / biothings querying ability itself.
|
This isn't an issue for "association-based" APIs, AKA where the structure is "one document per association" and all the info on the association is kept in a separate part of the document from the entity IDs. As soon as a document has parts (like multiple associations in 1 document, each document represents 1 of the entity IDs)....this problem happens. |
Hmm...this seems like it should be possible with post-processing in the transformer, but I agree that this would have to be basically on a per-API basis. We'd have to write new transformers for this, so it makes sense this should remain non-critical until we have more bandwidth. |
@tokebe Can you explain how it could be done with post-processing in the transformer? I was looking into this issue a bit and it seems like when querying from Gene->Disease, the disgenet score/information is missing so there would need to be other queries done to retrieve this information again. Also, would it be possible/practical to have something that says that mydisease should always be queried starting from Disease? |
I was under the impression that the issue is that querying Gene->Disease returns the the whole document, which we currently don't have the logic to pull out the disgenet score/information? this would be in the If this isn't the case, then yes, we'd have to come up with some other method of retrieving the additional information. I'm not sure exactly how practical it might be to specifically query mydisease Disease-first always, though it might be relatively doable with a custom query builder. This would still require a custom transformer, however, and some additional logic to ensure records are created in the correct direction. The preference would definitely be to post-process |
I did some more investigation and it doesn't grab the disgenet score/information at all when querying from Gene->Disease (the params pulled from the x-bte are completely different). I think what @colleenXu is saying above is this is a limitation of how the data is structured. So if we were to take the post-processing route, we would have to make a whole nother query to retrieve the disgenet score/info document, process that, then reincorporate it into the results. Below are the query configs and the resulting unTransformedHits: |
I suppose this makes the query-direction route more viable -- we'd need a separate query builder for mydisease that checks the subject/object semantic type and queries in reverse appropriately. It would have to somehow tag this such that the record is constructed in reverse of the query where appropriate as well. Perhaps a |
Err....I was out when this convo started but perhaps some more explanation / my perspective can help. I am in agreement with Jackson's points here, and that working with the unTransformedHits is better. But to develop that code, one will have to mutate the smartapi specs or work with a custom version of the smartapi yaml where the fields are specified differently (to retrieve all the info available in forward querying, during a reverse query). Notice the query I give in my original post. This query doesn't have the same "fields" specified as the query in the x-bte annotation right now, because we don't have the features to correctly process it (it would just be extra data to send over the internet / ignore while processing). |
Noting a related old discussion (internal lab Slack link): besides the "reverse" issue here, there's an issue of not being able to get a subset of the response. This is a problem when we want to treat those subsets differently (ex: assigning different biolink predicates or edge-attributes for the TRAPI response). Also pasted below:
|
JQ could be able to help.
|
Noting that we previously decided list_filter could not be used to address the "reverses" issue: see the internal Slack discussion starting here. It's hard to paste the whole convo here, but I may do it later... |
UpdateWe still have issues with not being able to retrieve all the information on the association in "reverse" direction. I was able to get MyChem aeolus count info to show in the reverse direction, by doing a non-batch query and using jmespath (only show the part of the json object that matches the starting ID). I was also able to get MyChem Chembl treats reference info to show in the reverse direction (see commit). But I wasn't able to use the same method to get the MyChem chembl drug-mechanism clinicaltrial info to show (
POST query version
And My Variant's
POST query where I get the error
|
Here's a list of the entity-based BioThings APIs (affected by the reverses issue):
|
These are the reverse operations where the forward direction has publication info that would be nice to retrieve. I organized by what seems doable now with jmespath (related to #733?)
|
I found a different jmespath issue with MyGene If I do this query, I get genes that match the disease, but I want to only keep the `clingen.clinical_validity` objects that have the matching disease.
Query:
Example hits:
But when I add jmespath, the hits that had one clinical_validity object (with the matching disease) become null.
Query:
Those same example hits:
I suspect jmespath is having issue with the array (multiple clinical_validity objects) vs object (1 clinical_validity object) in the original document... |
Made issues for the jmespath stuff I'm seeing:
|
Potential breakthrough: using a new parameter |
Tentatively labeling this a bug, but it may be an inherent limitation.
This query
produces this result:
But when I simply flip the
subject
andobject
, the result has more edge provenanceIs there some inherent limitation in the smartAPI annotation on why this asymmetry has to exist?
The text was updated successfully, but these errors were encountered: