reimplement `/data_objects/study/{study_id}` using alldocs #608

sujaypatil96 · 2024-07-23T19:28:33Z

Description

This PR seeks to reimplement the /data_objects/study/{study_id} endpoint that is currently part of the runtime API endpoints suite so that it accounts for the schema design considerations noted here: https://docs.google.com/document/d/1HglWENf1m9IDAuk6468OAFdCRvpC3wCQG8wwDHwll1M/edit?usp=sharing

Another point to note is that is that it also uses the alldocs collection for its implementation.

Fixes #401

TODO: write tests for this API endpoint.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

Definition of Done (DoD) Checklist:

A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.

My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (in docs/ and in https://github.com/microbiomedata/NMDC_documentation/?)
I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

…-like

sujaypatil96 · 2024-08-05T19:00:36Z

@PeopleMakeCulture i'm having a little trouble writing a test in tests/test_api/test_endpoints.py because I seem to be having some trouble running the tests that I'm writing. Would you be able to help me write a test for this sometime? Perhaps after the roundtable updates at squad sync call tomorrow?

relevant tot #576

…> blocks

…nerateCredentials fix #606

Co-authored-by: eecavanna <[email protected]>

close #613

…ol information

fixes #616

dwinston · 2024-08-08T14:40:05Z

I was thinking you could implement as an aggregation pipeline with a $graphLookup stage. The following, for example, yields for me 4,395 documents based on the prod db backup from 2024-07-29_20-12-07:

db.getCollection("alldocs").aggregate(

    // Pipeline
    [
        // Stage 1
        {
            $match: {
                "type": "Biosample",
                "part_of": "nmdc:sty-11-hdd4bf83"
            }
        },

        // Stage 2
        {
            $project: {
                "biosample_id": "$id"
            }
        },

        // Stage 3
        {
            $graphLookup: {
                from: "alldocs",
                startWith: "$biosample_id",
                connectFromField: "has_output",
                connectToField: "has_input",
                as: "process"
            }
        },

        // Stage 4
        {
            $unwind: {
                path: "$process"
            }
        },

        // Stage 5
        {
            $match: {
                "process.has_output.0": {"$exists": true}
            }
        },

        // Stage 6
        {
            $unwind: {
                path: "$process.has_output"
            }
        },

        // Stage 7
        {
            $project: {
                "has_output": "$process.has_output",
                "biosample_id": 1
            }
        },

        // Stage 8
        {
            $lookup: {
                from: "data_object_set",
                localField: "has_output",
                foreignField: "id",
                as: "data_object"
            }
        },

        // Stage 9
        {
            $match: {
                "data_object.0": {"$exists": true}
            }
        },

        // Stage 10
        {
            $project: {
                "biosample_id": 1,
                "data_object": {"$first": "$data_object"}
            }
        }
    ],
);

The form of the resulting documents in this case is e.g.:

{
    "_id" : ObjectId("66b4c7a8a2c847e76f4a1619"),
    "biosample_id" : "nmdc:bsm-11-1gzgce32",
    "data_object" : {
        "_id" : ObjectId("65a0688bfa6e7e39b884012f"),
        "id" : "nmdc:dobj-11-72cj2a39",
        "name" : "nmdc_wfmgas-11-dc2b0f84.1_contigs.fna",
        "description" : "Assembly contigs for nmdc:wfmgas-11-dc2b0f84.1",
        "alternative_identifiers" : [

        ],
        "file_size_bytes" : 127171635,
        "md5_checksum" : "090e8612f7bcf18530bb088f79a3cb54",
        "data_object_type" : "Assembly Contigs",
        "url" : "https://data.microbiomedata.org/data/nmdc:omprc-11-avd3n976/nmdc:wfmgas-11-dc2b0f84.1/nmdc_wfmgas-11-dc2b0f84.1_contigs.fna",
        "type" : "nmdc:DataObject"
    }
}

sujaypatil96 · 2024-08-08T18:04:57Z

Wow, thanks for adding the tests and a prototype of the implementation with $graphLookup @dwinston!! i'll take a look at it. Looks super promising ❤️

sujaypatil96 · 2024-08-13T23:55:36Z

@dwinston I ran:

make up-test
make test-build
docker compose --file docker-compose.test.yml run test -k 'test_find_data_objects_for_study'

The test test_find_data_objects_for_study() ran successfully, and returned 61 biosamples and data objects, which is the number of records that we expect to be returned. For the nmdc:sty-11-hdd4bf83 study, the current API endpoint as we have it also returns 61 biosamples and data objects so this works as expected. I believe it's ready to be merged.

sujaypatil96 · 2024-08-13T23:58:16Z

Also, I will capture the code for the aggregation pipeline here on a separate issue, and we can work on benchmarking the two approaches -- Python (implemented in this PR), and Python / Mongo (here). We will ultimately pick the superior approach, and also use that to implement the general rollup endpoint.

nmdc_runtime/api/endpoints/find.py

nmdc_runtime/api/db/mongo.py

eecavanna · 2024-08-14T00:31:43Z

tests/test_api/test_endpoints.py

+)  # 84MB. Should be < 100MB.
+
+
+def ensure_local_mongodump():


I assume the name "ensure X" is being used to mean "ensure X exists". If that is the case, I'd prefer the function name be updated to say that specifically.

eecavanna · 2024-08-14T00:34:46Z

tests/test_api/test_endpoints.py

+    "nmdc-prod-schema-collections__2024-07-29_20-12-07"
+)
+SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_URL = (
+    "https://portal.nersc.gov/cfs/m3408/meta/mongodumps/"


I am wondering whether there is documentation in this folder on the NERSC filesystem, that people there can use to know that the file is used by this code base. Maybe the file could be moved to a directory named nmdc-runtime-assets, for example.

@eecavanna noted this issue. A move seems reasonable, and we should document this. Started microbiomedata/infra-admin#110

aclum · 2024-08-15T16:32:44Z

Is this PR going to be ready soon? I would like this updated endpoint to be out for the end of August release because I would like to get IMG switched over to using our API to find data instead of looking directly at the filesystem.

sujaypatil96 · 2024-08-15T18:09:12Z

@aclum yes, this PR is actually ready to be merged. I'm addressing some of @eecavanna's review comments now.

…dpoints/find.py` Co-authored-by: eecavanna <[email protected]>

…go.py` Co-authored-by: eecavanna <[email protected]>

…endpoint

reimplement /data_objects/study/{study_id} using alldocs

1e85573

sujaypatil96 changed the title ~~reimplement /data_objects/study/{study_id} using alldocs~~ reimplement /data_objects/study/{study_id} using alldocs Jul 23, 2024

sujaypatil96 added 2 commits August 2, 2024 16:10

make output of /data_objects/study/{study_id} endpoint look more JSON…

0ad6782

…-like

return full DataObjects instead of just DataObject ids

e7f03fb

sujaypatil96 marked this pull request as ready for review August 5, 2024 19:00

sujaypatil96 and others added 24 commits August 7, 2024 17:14

get DataObjects for /data_objects/study/{study_id} from data_object_set

3084f4c

feat: refactor notebook and supporting code

85dab16

fix: Ensure extra_kwargs is optional

4c0c061

fix: no trying **extra_kwargs until needed

2a2ff66

feat: speed up nb

e8830b8

relevant tot #576

populate library attributes in SRA <Action> blocks

daf4ed2

pull in library_name into NCBI XML

813af36

fix tests/test_data/test_ncbi_xml.py test file

23775ec

use only file basename to populate file_path attribute in SRA <Action…

6b88119

…> blocks

Upgrade to nmdc-schema 10.7.0

a18b537

doc: clarify that user must be site admin to POST /sites/{site_id}:ge…

75ee1fa

…nerateCredentials fix #606

style: reformat

1ab1139

Update nmdc_runtime/api/endpoints/sites.py

71458a2

Co-authored-by: eecavanna <[email protected]>

Update nmdc_runtime/api/endpoints/sites.py

ca8cd3d

Co-authored-by: eecavanna <[email protected]>

chore: clean up a bit

3dd93cc

fix: changesheet can update bytes-ranged slot

e86fe74

close #613

fix: ensure doc for changesheet

782a28c

add broker name as custom NCBI <Attribute>

2ec967e

expand abbreviated NMDC

0f9f595

pull in value for SRA Attribute library_construction_protocol

9a281a6

update tests in test_ncbi_xml.py to check library_construction_protoc…

9a877ee

…ol information

fix: part1 for builtins

7a22016

add test file

dc68770

fix: ensure regression test, and expand code documentation

0b29331

fixes #616

fix: ensure extract_to dir

d824710

ssarrafan mentioned this pull request Aug 12, 2024

Mock up an API endpoint that, given a Study.id and DataObject.data_object_type, returns the relevant DataObjects #401

Closed

sujaypatil96 requested review from dwinston, PeopleMakeCulture and eecavanna August 13, 2024 23:55

sujaypatil96 mentioned this pull request Aug 14, 2024

Mongo aggregation pipeline approach to implement database rollup #641

Open

eecavanna reviewed Aug 14, 2024

View reviewed changes

nmdc_runtime/api/endpoints/find.py Show resolved Hide resolved

eecavanna reviewed Aug 14, 2024

View reviewed changes

nmdc_runtime/api/db/mongo.py Show resolved Hide resolved

eecavanna reviewed Aug 14, 2024

View reviewed changes

nmdc_runtime/api/db/mongo.py Show resolved Hide resolved

eecavanna reviewed Aug 14, 2024

View reviewed changes

sujaypatil96 and others added 9 commits August 15, 2024 11:09

docstring for get_classname_from_typecode() in `nmdc_runtime/api/en…

a03b083

…dpoints/find.py` Co-authored-by: eecavanna <[email protected]>

docstring for mongorestore_collection() in `nmdc_runtime/api/db/mon…

63c0c5a

…go.py` Co-authored-by: eecavanna <[email protected]>

style: reformat

b5848ae

make output of /data_objects/study/{study_id} look more JSON-like

fd75485

Merge branch 'main' into issue-401-reimplement-data-objects-study-id-…

69c6f00

…endpoint

doc: add docstring

289850b

fix: reassert type

a0e927b

chore: remove unused imports

712657b

style: rename

b90710b

dwinston approved these changes Aug 21, 2024

View reviewed changes

sujaypatil96 merged commit c996255 into main Aug 21, 2024
2 checks passed

sujaypatil96 deleted the issue-401-reimplement-data-objects-study-id-endpoint branch August 21, 2024 20:15

kheal mentioned this pull request Sep 3, 2024

API route: from biosample id, return associated dataobject ids #667

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reimplement `/data_objects/study/{study_id}` using alldocs #608

reimplement `/data_objects/study/{study_id}` using alldocs #608

sujaypatil96 commented Jul 23, 2024 •

edited

Loading

sujaypatil96 commented Aug 5, 2024

dwinston commented Aug 8, 2024

sujaypatil96 commented Aug 8, 2024

sujaypatil96 commented Aug 13, 2024

sujaypatil96 commented Aug 13, 2024 •

edited

Loading

eecavanna Aug 14, 2024

eecavanna Aug 14, 2024 •

edited

Loading

dwinston Aug 21, 2024

aclum commented Aug 15, 2024

sujaypatil96 commented Aug 15, 2024

reimplement /data_objects/study/{study_id} using alldocs #608

reimplement /data_objects/study/{study_id} using alldocs #608

Conversation

sujaypatil96 commented Jul 23, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Definition of Done (DoD) Checklist:

sujaypatil96 commented Aug 5, 2024

dwinston commented Aug 8, 2024

sujaypatil96 commented Aug 8, 2024

sujaypatil96 commented Aug 13, 2024

sujaypatil96 commented Aug 13, 2024 • edited Loading

eecavanna Aug 14, 2024

Choose a reason for hiding this comment

eecavanna Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

dwinston Aug 21, 2024

Choose a reason for hiding this comment

aclum commented Aug 15, 2024

sujaypatil96 commented Aug 15, 2024

reimplement `/data_objects/study/{study_id}` using alldocs #608

reimplement `/data_objects/study/{study_id}` using alldocs #608

sujaypatil96 commented Jul 23, 2024 •

edited

Loading

sujaypatil96 commented Aug 13, 2024 •

edited

Loading

eecavanna Aug 14, 2024 •

edited

Loading