-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reimplement /data_objects/study/{study_id}
using alldocs
#608
reimplement /data_objects/study/{study_id}
using alldocs
#608
Conversation
/data_objects/study/{study_id}
using alldocs
@PeopleMakeCulture i'm having a little trouble writing a test in |
relevant tot #576
…nerateCredentials fix #606
Co-authored-by: eecavanna <[email protected]>
Co-authored-by: eecavanna <[email protected]>
I was thinking you could implement as an aggregation pipeline with a db.getCollection("alldocs").aggregate(
// Pipeline
[
// Stage 1
{
$match: {
"type": "Biosample",
"part_of": "nmdc:sty-11-hdd4bf83"
}
},
// Stage 2
{
$project: {
"biosample_id": "$id"
}
},
// Stage 3
{
$graphLookup: {
from: "alldocs",
startWith: "$biosample_id",
connectFromField: "has_output",
connectToField: "has_input",
as: "process"
}
},
// Stage 4
{
$unwind: {
path: "$process"
}
},
// Stage 5
{
$match: {
"process.has_output.0": {"$exists": true}
}
},
// Stage 6
{
$unwind: {
path: "$process.has_output"
}
},
// Stage 7
{
$project: {
"has_output": "$process.has_output",
"biosample_id": 1
}
},
// Stage 8
{
$lookup: {
from: "data_object_set",
localField: "has_output",
foreignField: "id",
as: "data_object"
}
},
// Stage 9
{
$match: {
"data_object.0": {"$exists": true}
}
},
// Stage 10
{
$project: {
"biosample_id": 1,
"data_object": {"$first": "$data_object"}
}
}
],
); The form of the resulting documents in this case is e.g.: {
"_id" : ObjectId("66b4c7a8a2c847e76f4a1619"),
"biosample_id" : "nmdc:bsm-11-1gzgce32",
"data_object" : {
"_id" : ObjectId("65a0688bfa6e7e39b884012f"),
"id" : "nmdc:dobj-11-72cj2a39",
"name" : "nmdc_wfmgas-11-dc2b0f84.1_contigs.fna",
"description" : "Assembly contigs for nmdc:wfmgas-11-dc2b0f84.1",
"alternative_identifiers" : [
],
"file_size_bytes" : 127171635,
"md5_checksum" : "090e8612f7bcf18530bb088f79a3cb54",
"data_object_type" : "Assembly Contigs",
"url" : "https://data.microbiomedata.org/data/nmdc:omprc-11-avd3n976/nmdc:wfmgas-11-dc2b0f84.1/nmdc_wfmgas-11-dc2b0f84.1_contigs.fna",
"type" : "nmdc:DataObject"
}
} |
Wow, thanks for adding the tests and a prototype of the implementation with |
@dwinston I ran:
The test |
Also, I will capture the code for the aggregation pipeline here on a separate issue, and we can work on benchmarking the two approaches -- Python (implemented in this PR), and Python / Mongo (here). We will ultimately pick the superior approach, and also use that to implement the general rollup endpoint. |
tests/test_api/test_endpoints.py
Outdated
) # 84MB. Should be < 100MB. | ||
|
||
|
||
def ensure_local_mongodump(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the name "ensure X" is being used to mean "ensure X exists". If that is the case, I'd prefer the function name be updated to say that specifically.
"nmdc-prod-schema-collections__2024-07-29_20-12-07" | ||
) | ||
SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_URL = ( | ||
"https://portal.nersc.gov/cfs/m3408/meta/mongodumps/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether there is documentation in this folder on the NERSC filesystem, that people there can use to know that the file is used by this code base. Maybe the file could be moved to a directory named nmdc-runtime-assets
, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eecavanna noted this issue. A move seems reasonable, and we should document this. Started microbiomedata/infra-admin#110
Is this PR going to be ready soon? I would like this updated endpoint to be out for the end of August release because I would like to get IMG switched over to using our API to find data instead of looking directly at the filesystem. |
@aclum yes, this PR is actually ready to be merged. I'm addressing some of @eecavanna's review comments now. |
…dpoints/find.py` Co-authored-by: eecavanna <[email protected]>
…go.py` Co-authored-by: eecavanna <[email protected]>
Description
This PR seeks to reimplement the
/data_objects/study/{study_id}
endpoint that is currently part of the runtime API endpoints suite so that it accounts for the schema design considerations noted here: https://docs.google.com/document/d/1HglWENf1m9IDAuk6468OAFdCRvpC3wCQG8wwDHwll1M/edit?usp=sharingAnother point to note is that is that it also uses the
alldocs
collection for its implementation.Fixes #401
TODO: write tests for this API endpoint.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply
make up-test && make test-run
.Definition of Done (DoD) Checklist:
A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.
black nmdc_runtime/
?)docs/
and in https://github.com/microbiomedata/NMDC_documentation/?)make up-test && make test-run
)