Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dagster job to create rollup JSON/collection #643

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sujaypatil96
Copy link
Collaborator

@sujaypatil96 sujaypatil96 commented Aug 16, 2024

Description

Create a Dagster job that makes a Biosample-keyed rollup JSON, which can be imported into a collection in Mongo. The idea being, similar to the /data_objects/study/{study_id}, if we provide a study_id to (in this case) a Dagster job, we should be able to retrieve parseable JSON or create a collection in Mongo that instead of just the Data Objects, has all associated NMDC objects materialized in it.

Fixes #642

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

Definition of Done (DoD) Checklist:

A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.

  • My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (in docs/ and in https://github.com/microbiomedata/NMDC_documentation/?)
  • I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
  • New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

@sujaypatil96 sujaypatil96 marked this pull request as ready for review October 30, 2024 16:36
@sujaypatil96
Copy link
Collaborator Author

This is a first pass as to what the JSON should look like. We can make improvements as necessary.

@sujaypatil96 sujaypatil96 requested review from aclum and removed request for pkalita-lbl October 31, 2024 15:07
Copy link
Contributor

@aclum aclum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces several new functions without any corresponding test coverage. I'm also unclear if we need to still be explicitly specifying the relationship slots or if refscan is in a place where we can use functions from that code base to derive relationship slots.

Copy link
Collaborator

@pkalita-lbl pkalita-lbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm a little confused about the broader purpose of this Dagster job. Who would run it? When? And for what purpose?

I could imagine a Dagster job that pre-computes a collection which maps Biosample IDs to other associated IDs and then having an API endpoint which utilizes that new collection. But this Dagster job that only does it for one Biosample has me scratching my head a little.

@sujaypatil96
Copy link
Collaborator Author

All good questions above @pkalita-lbl!

So the overall vision/direction in which this is headed (working on this in the referential integrity and rollup squad) is that we're going to have a rollup collection with the Biosample rollup for all studies materialized in it, and this will be created by way of a Dagster job.

But this Dagster job that only does it for one Biosample has me scratching my head a little.

Hm, the Dagster job in this PR at the moment does it for a study. It produces a JSON with a list of documents which has each of the Biosamples that are part of the study associated with the IDs that are "related" to that Biosample.

@pkalita-lbl
Copy link
Collaborator

Sorry that's my mistake, I meant to say one Study, not one Biosample.

So then is the idea that this job will eventually be expanded to compute the associated IDs for all studies? Is this just a first step towards that?

@aclum
Copy link
Contributor

aclum commented Nov 1, 2024

The use case (I think) is we have studies with some but not all of the data, so the ETL script needs to figure out which samples need 1) biosamples or 2) data generation records. For example for NEON we had to deleted a subset of the malformed data object records and their corresponding nucleotide sequencing records and now we need to bring in the fixed records.

@sujaypatil96
Copy link
Collaborator Author

So then is the idea that this job will eventually be expanded to compute the associated IDs for all studies? Is this just a first step towards that?

Correct, that's exactly right! Apologies, I should have mentioned that in the PR description for context. But yes, the idea is that there will be a collection in Mongo which will have associated IDs for all biosamples from all studies, and yup, this is a first step towards that.

Ideally there will be an endpoint which accepts an NMDC study id as input, and also interacts with that "biosample rollup collection" to retrieve a rollup JSON for that study alone.

The use case (I think) is we have studies with some but not all of the data, so the ETL script needs to figure out which samples need 1) biosamples or 2) data generation records. For example for NEON we had to deleted a subset of the malformed data object records and their corresponding nucleotide sequencing records and now we need to bring in the fixed records.

Yup, that's an immediate requirement for which this will be useful. But eventually once we have the materialized rollup collection in Mongo and the endpoint on top of it we should be able to plug in a study id, like say, the NEON study ids and get a JSON that has the biosamples and its associated ids in a JSON.

@pkalita-lbl
Copy link
Collaborator

Okay thanks for filling in those details. This makes a bit more sense now.

Copy link
Collaborator

@pkalita-lbl pkalita-lbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes seem reasonable. More tests would be appreciated, but I suppose those can be added when this work gets changed to build actual Mongo collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write a Dagster job to create the Biosample rollup collection
3 participants