-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dagster job to create rollup JSON/collection #643
base: main
Are you sure you want to change the base?
Conversation
This is a first pass as to what the JSON should look like. We can make improvements as necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces several new functions without any corresponding test coverage. I'm also unclear if we need to still be explicitly specifying the relationship slots or if refscan is in a place where we can use functions from that code base to derive relationship slots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm a little confused about the broader purpose of this Dagster job. Who would run it? When? And for what purpose?
I could imagine a Dagster job that pre-computes a collection which maps Biosample IDs to other associated IDs and then having an API endpoint which utilizes that new collection. But this Dagster job that only does it for one Biosample has me scratching my head a little.
All good questions above @pkalita-lbl! So the overall vision/direction in which this is headed (working on this in the referential integrity and rollup squad) is that we're going to have a rollup collection with the Biosample rollup for all studies materialized in it, and this will be created by way of a Dagster job.
Hm, the Dagster job in this PR at the moment does it for a study. It produces a JSON with a list of documents which has each of the Biosamples that are part of the study associated with the IDs that are "related" to that Biosample. |
Sorry that's my mistake, I meant to say one Study, not one Biosample. So then is the idea that this job will eventually be expanded to compute the associated IDs for all studies? Is this just a first step towards that? |
The use case (I think) is we have studies with some but not all of the data, so the ETL script needs to figure out which samples need 1) biosamples or 2) data generation records. For example for NEON we had to deleted a subset of the malformed data object records and their corresponding nucleotide sequencing records and now we need to bring in the fixed records. |
Correct, that's exactly right! Apologies, I should have mentioned that in the PR description for context. But yes, the idea is that there will be a collection in Mongo which will have associated IDs for all biosamples from all studies, and yup, this is a first step towards that. Ideally there will be an endpoint which accepts an NMDC study id as input, and also interacts with that "biosample rollup collection" to retrieve a rollup JSON for that study alone.
Yup, that's an immediate requirement for which this will be useful. But eventually once we have the materialized rollup collection in Mongo and the endpoint on top of it we should be able to plug in a study id, like say, the NEON study ids and get a JSON that has the biosamples and its associated ids in a JSON. |
Okay thanks for filling in those details. This makes a bit more sense now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes seem reasonable. More tests would be appreciated, but I suppose those can be added when this work gets changed to build actual Mongo collections.
Description
Create a Dagster job that makes a Biosample-keyed rollup JSON, which can be imported into a collection in Mongo. The idea being, similar to the /data_objects/study/{study_id}, if we provide a study_id to (in this case) a Dagster job, we should be able to retrieve parseable JSON or create a collection in Mongo that instead of just the Data Objects, has all associated NMDC objects materialized in it.
Fixes #642
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply
make up-test && make test-run
.Definition of Done (DoD) Checklist:
A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.
black nmdc_runtime/
?)docs/
and in https://github.com/microbiomedata/NMDC_documentation/?)make up-test && make test-run
)