Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement /data_objects/study/{study_id} using alldocs #608

Merged

Conversation

sujaypatil96
Copy link
Collaborator

@sujaypatil96 sujaypatil96 commented Jul 23, 2024

Description

This PR seeks to reimplement the /data_objects/study/{study_id} endpoint that is currently part of the runtime API endpoints suite so that it accounts for the schema design considerations noted here: https://docs.google.com/document/d/1HglWENf1m9IDAuk6468OAFdCRvpC3wCQG8wwDHwll1M/edit?usp=sharing

Another point to note is that is that it also uses the alldocs collection for its implementation.

Fixes #401

TODO: write tests for this API endpoint.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

Definition of Done (DoD) Checklist:

A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.

  • My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (in docs/ and in https://github.com/microbiomedata/NMDC_documentation/?)
  • I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
  • New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

@sujaypatil96 sujaypatil96 changed the title reimplement /data_objects/study/{study_id} using alldocs reimplement /data_objects/study/{study_id} using alldocs Jul 23, 2024
@sujaypatil96
Copy link
Collaborator Author

@PeopleMakeCulture i'm having a little trouble writing a test in tests/test_api/test_endpoints.py because I seem to be having some trouble running the tests that I'm writing. Would you be able to help me write a test for this sometime? Perhaps after the roundtable updates at squad sync call tomorrow?

@sujaypatil96 sujaypatil96 marked this pull request as ready for review August 5, 2024 19:00
sujaypatil96 and others added 24 commits August 7, 2024 17:14
relevant tot #576
@dwinston
Copy link
Collaborator

dwinston commented Aug 8, 2024

I was thinking you could implement as an aggregation pipeline with a $graphLookup stage. The following, for example, yields for me 4,395 documents based on the prod db backup from 2024-07-29_20-12-07:

db.getCollection("alldocs").aggregate(

    // Pipeline
    [
        // Stage 1
        {
            $match: {
                "type": "Biosample",
                "part_of": "nmdc:sty-11-hdd4bf83"
            }
        },

        // Stage 2
        {
            $project: {
                "biosample_id": "$id"
            }
        },

        // Stage 3
        {
            $graphLookup: {
                from: "alldocs",
                startWith: "$biosample_id",
                connectFromField: "has_output",
                connectToField: "has_input",
                as: "process"
            }
        },

        // Stage 4
        {
            $unwind: {
                path: "$process"
            }
        },

        // Stage 5
        {
            $match: {
                "process.has_output.0": {"$exists": true}
            }
        },

        // Stage 6
        {
            $unwind: {
                path: "$process.has_output"
            }
        },

        // Stage 7
        {
            $project: {
                "has_output": "$process.has_output",
                "biosample_id": 1
            }
        },

        // Stage 8
        {
            $lookup: {
                from: "data_object_set",
                localField: "has_output",
                foreignField: "id",
                as: "data_object"
            }
        },

        // Stage 9
        {
            $match: {
                "data_object.0": {"$exists": true}
            }
        },

        // Stage 10
        {
            $project: {
                "biosample_id": 1,
                "data_object": {"$first": "$data_object"}
            }
        }
    ],
);

The form of the resulting documents in this case is e.g.:

{
    "_id" : ObjectId("66b4c7a8a2c847e76f4a1619"),
    "biosample_id" : "nmdc:bsm-11-1gzgce32",
    "data_object" : {
        "_id" : ObjectId("65a0688bfa6e7e39b884012f"),
        "id" : "nmdc:dobj-11-72cj2a39",
        "name" : "nmdc_wfmgas-11-dc2b0f84.1_contigs.fna",
        "description" : "Assembly contigs for nmdc:wfmgas-11-dc2b0f84.1",
        "alternative_identifiers" : [

        ],
        "file_size_bytes" : 127171635,
        "md5_checksum" : "090e8612f7bcf18530bb088f79a3cb54",
        "data_object_type" : "Assembly Contigs",
        "url" : "https://data.microbiomedata.org/data/nmdc:omprc-11-avd3n976/nmdc:wfmgas-11-dc2b0f84.1/nmdc_wfmgas-11-dc2b0f84.1_contigs.fna",
        "type" : "nmdc:DataObject"
    }
}

@sujaypatil96
Copy link
Collaborator Author

Wow, thanks for adding the tests and a prototype of the implementation with $graphLookup @dwinston!! i'll take a look at it. Looks super promising ❤️

@sujaypatil96
Copy link
Collaborator Author

@dwinston I ran:

make up-test
make test-build
docker compose --file docker-compose.test.yml run test -k 'test_find_data_objects_for_study'

The test test_find_data_objects_for_study() ran successfully, and returned 61 biosamples and data objects, which is the number of records that we expect to be returned. For the nmdc:sty-11-hdd4bf83 study, the current API endpoint as we have it also returns 61 biosamples and data objects so this works as expected. I believe it's ready to be merged.

@sujaypatil96
Copy link
Collaborator Author

sujaypatil96 commented Aug 13, 2024

Also, I will capture the code for the aggregation pipeline here on a separate issue, and we can work on benchmarking the two approaches -- Python (implemented in this PR), and Python / Mongo (here). We will ultimately pick the superior approach, and also use that to implement the general rollup endpoint.

) # 84MB. Should be < 100MB.


def ensure_local_mongodump():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the name "ensure X" is being used to mean "ensure X exists". If that is the case, I'd prefer the function name be updated to say that specifically.

"nmdc-prod-schema-collections__2024-07-29_20-12-07"
)
SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_URL = (
"https://portal.nersc.gov/cfs/m3408/meta/mongodumps/"
Copy link
Collaborator

@eecavanna eecavanna Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether there is documentation in this folder on the NERSC filesystem, that people there can use to know that the file is used by this code base. Maybe the file could be moved to a directory named nmdc-runtime-assets, for example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eecavanna noted this issue. A move seems reasonable, and we should document this. Started microbiomedata/infra-admin#110

@aclum
Copy link
Contributor

aclum commented Aug 15, 2024

Is this PR going to be ready soon? I would like this updated endpoint to be out for the end of August release because I would like to get IMG switched over to using our API to find data instead of looking directly at the filesystem.

@sujaypatil96
Copy link
Collaborator Author

@aclum yes, this PR is actually ready to be merged. I'm addressing some of @eecavanna's review comments now.

@sujaypatil96 sujaypatil96 merged commit c996255 into main Aug 21, 2024
2 checks passed
@sujaypatil96 sujaypatil96 deleted the issue-401-reimplement-data-objects-study-id-endpoint branch August 21, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants