Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metaproteomics aggregation script #27

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kheal
Copy link
Contributor

@kheal kheal commented Nov 7, 2024

This PR will refactor the generate_metap_agg.py script to address #26.

Overall, the generate_metap_agg.py has been refactored to

  1. Use the runtime API to reduce reliance with direct connection to MongoDB and take advantage of validation
  2. Access the raw data objects (not mongo data) to generate the functional annotations for aggregation for metaproteomics results (see Add ADR for metaP mongo data issues#920)
  3. Validate and submit data via API

Will not be ready for release until microbiomedata/nmdc-schema#2203 has been merged in (done).

@kheal kheal changed the title Start to refactor metaproteomics aggregation class Refactor metaproteomics aggregation script Nov 7, 2024
@kheal kheal requested a review from aclum November 7, 2024 22:04
@kheal
Copy link
Contributor Author

kheal commented Nov 7, 2024

@aclum - I'd appreciate any early feedback you have on the general approach, feel free to tag anyone else who should have eyes on this.

I'm not sure how exactly we'll be able to set the API bearer tokens as environmental variables, but that's how I've been doing development (I altered the readme to describe what environmental variables we'll need).

@kheal kheal linked an issue Nov 7, 2024 that may be closed by this pull request
@aclum
Copy link
Collaborator

aclum commented Nov 7, 2024

You can set both environmental variables and secrets in SPIN. For this we could have variables for the client ID and password and use those to get the bearer token using the https://api.microbiomedata.org/docs#/users/login_for_access_token_token_post endpoint. cc @eecavanna @shreddd

@aclum
Copy link
Collaborator

aclum commented Nov 7, 2024

variables like API url can also be moved to an environmental variable in SPIN.

@aclum
Copy link
Collaborator

aclum commented Nov 7, 2024

@kheal
Copy link
Contributor Author

kheal commented Nov 9, 2024

@aclum. Thanks for the input. I've rewritten to include a call to get a bearer token with the API username and password (set as environmental variables).

I've also pulled out the functions we can reuse for the classes into an abstract Aggregator class. That should make the work for translating the other aggregator to use the API more straightforward. So the other MetaGenomeFuncAgg really just has to write its own process_activity method and declare its filters to find appropriate records.

I'll leave this in draft until the next release since it depends on the migrated database and next schema release.

generate_metap_agg.py Outdated Show resolved Hide resolved
generate_metap_agg.py Outdated Show resolved Hide resolved
@kheal kheal requested a review from picowatt November 19, 2024 17:46
@kheal
Copy link
Contributor Author

kheal commented Nov 19, 2024

@picowatt I'll let you know when this is ready for review - I'm going to incorporate Alicia's comments and test this after new release first.

@kheal
Copy link
Contributor Author

kheal commented Nov 20, 2024

This is now generated json files that validate via the json:submit endpoint. Unfortunately I haven't been able to generate a bearer token with the appropriate level of permissions to actually test the submission to the dev data portal. I'll reach out to the runtime crew for help on that front.

Permissions aside, I think this is ready for review/merging now.

@kheal kheal marked this pull request as ready for review November 20, 2024 02:43
@kheal
Copy link
Contributor Author

kheal commented Nov 25, 2024

With my updated permissions (thanks @eecavanna), I checked that the json.submit endpoint is working as expected.

I loaded a single metaP's annotations to dev mongo's functional_annotation_agg set.
nmdc:wfmp-11-x0zhd078.1 is the ID of the workflow record that is now included in functional_annotation_agg set in dev mongo.

@aclum - is there a server/data portal issue to make sure the functional searches/ingests are expecting MetaProteomics records in the functional_annotation_agg?

@aclum
Copy link
Collaborator

aclum commented Nov 25, 2024

No, there is not a corresponding nmdc-server ticket yet, would you please make one @kheal ?

@eecavanna @dwinston what is the max payload json:submit can handle? @kheal what is the max length for expected aggregation results?

@aclum aclum requested a review from shreddd November 25, 2024 19:19
@kheal
Copy link
Contributor Author

kheal commented Nov 25, 2024

I wrote the script so that the json:submit only submits one workflow's aggregation results at at time to avoid payload issues - though I haven't testing the full lot yet. I can run the whole script locally to write into dev mongo overnight as a test.

@kheal
Copy link
Contributor Author

kheal commented Nov 25, 2024

No, there is not a corresponding nmdc-server ticket yet, would you please make one @kheal ?

Associated ticket filed here: microbiomedata/nmdc-server#1468

@kheal
Copy link
Contributor Author

kheal commented Nov 26, 2024

Moving this back into draft. Testing revealed the first API call to be exceptionally slow, attempting to fix now.

@kheal kheal marked this pull request as draft November 26, 2024 20:38
@kheal
Copy link
Contributor Author

kheal commented Nov 26, 2024

Moving this back into draft. Testing revealed the first API call to be exceptionally slow, attempting to fix now.

I've implemented a partial fix for this that will likely not work for future subclasses of the Aggregator class. I will submit a follow up issue to fix this once the issue on nmdc-runtime is addressed. Issue filed here: #28

@kheal what is the max length for expected aggregation results?

I tested the full run of the MetaP aggregator sweep method and did not encounter any payload issues with json:submit.

@kheal kheal marked this pull request as ready for review November 26, 2024 22:44
Copy link
Collaborator

@shreddd shreddd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks reasonable (my comments are minor and can be ignored if they don't apply).

Also - Is there a way to test this code to make sure it works?


def find_anno(self, dos):
rv = requests.post(self.base_url + "/token", data=token_request_body)
token_response = rv.json()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may also want to check rv.status_code or log it when raising an exception

)
json_record_full = {"functional_annotation_agg": json_records}

response = self.submit_json_records(json_record_full)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will error out with an Exception rather than logging an error and moving to the next record if you don't succeed in the submit call. That may be the correct behavior (since it is likely a bad server connection or similar), but just wanted to make sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor metaP aggregation
4 participants