Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen #521

Closed
wants to merge 14 commits into from

Conversation

PeopleMakeCulture
Copy link
Collaborator

Description

This PR add two notebooks for prototyping validation solutions to the nmdc-runtime repo

Related to #318

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Notebooks have been run locally

@eecavanna
Copy link
Collaborator

eecavanna commented May 11, 2024

Rendered notebooks

Here are links to the notebooks being introduced via this PR, in their rendered form. I'm adding these links here because GitHub's "Files changed" tab shows the notebooks in source code form instead of in rendered form.

@eecavanna I've updated these links to point to rendered notebooks on this branch #318

@PeopleMakeCulture PeopleMakeCulture changed the title Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen [DRAFT:DO NOT MERGE] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen May 11, 2024
@PeopleMakeCulture
Copy link
Collaborator Author

PeopleMakeCulture commented May 11, 2024

TODOs before merging

  • remove tmp code
  • add set-up instructions
  • make edits from @aclum
  • pre-cleaning step needs these fields 1) metagenome_annotation_id 2) metaproteomic_analysis_id

Non-actionable notes:

  • Type is not on all classes, and the value of type is currently a string and can have typos. Both of these are fixed in berkeley schema but is risky to use at this time
  • There are some IDs in the current mongo that are not unique across collections, this will be fixed by re-iding

@PeopleMakeCulture
Copy link
Collaborator Author

@eecavanna @turbomam The "Mongo-to-RDF transformer" notebook should now run on other machines. Would either of you care to give it a test run?

You can also skip the Docker steps once the RDF is generated and load it to another graph db server.

@dwinston dwinston marked this pull request as draft May 16, 2024 19:37
@PeopleMakeCulture PeopleMakeCulture changed the title [DRAFT:DO NOT MERGE] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen May 16, 2024
@PeopleMakeCulture PeopleMakeCulture changed the title Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen [DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen May 16, 2024
@PeopleMakeCulture
Copy link
Collaborator Author

PeopleMakeCulture commented May 16, 2024

Hi Folks,

Requesting your review to check that these notebooks:

  1. contain the correct logic
  2. contain adequate documentation
  3. can be run in your dev environment

Please let me know if you run into any issues with the notebooks

@eecavanna
Copy link
Collaborator

I added some comments and TODOs in a commit just now. I haven't reviewed the "Check referential integrity" section or anything below that yet. I'll continue reviewing this during the week. In the meantime, you can continue making changes to the notebook (e.g. adding explanatory comments). I also haven't tried running it locally yet.

Copy link
Collaborator

@sujaypatil96 sujaypatil96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks great @PeopleMakeCulture!! I'm running the metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb notebook as part of my code review and I seem to running into an error in the block of code that is looking to create/materialize the alldocs collection.

The error I'm running into is this (truncated stacktrace):

ValueError:  Unknown argument: quality_control_report = {'status': 'pass'}

I have nmdc_schema==10.2.0 in my virtual environment.

Can you help debug this error?

@sujaypatil96
Copy link
Collaborator

The alldocs collection that you are creating as part of the notebooks in this PR will be very useful for my work in the NCBI Export squad as well, so I'm looking forward to using it.

@sujaypatil96
Copy link
Collaborator

sujaypatil96 commented May 28, 2024

Looks like I had restored documents from an older version/dump of the Mongo database which is why the block of code that I reported above was erroring out. I can confirm that I was able to successfully restore all documents from the latest dump of Mongo that you sent to me.

Thank you for the help @PeopleMakeCulture 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sparql notebook need to be reviewed? I thought we were using a strictly python approach?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be ignored for now

"source": [
"Determine the name of each Mongo collection in which at least one document has a field named `id`.\n",
"\n",
"> **TODO:** Documents in the [`functional_annotation_agg` collection](https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/) do not have a field named `id`, and so will not be included here. Document the author's rationale for omitting it.\n",
Copy link
Contributor

@aclum aclum May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to include the functional_annotation_agg collection and the metap_gene_function_aggregation (not defined in the schema)

}
],
"source": [
"# check these slots for null values for all docs in collection_names\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this b/c the nulls cause problesm for rdf generation?

" # and insert the resulting document into the `alldocs` collection. Note that we are not\n",
" # relying on the original value of the `type` field, since it's unreliable (see below).\n",
" \n",
" # NOTE: `type` is currently a string, does not exist for all classes, and can have typos. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type currently being a string caused a lot of problems with re-iding. Current recommendation is infer type from collection name until we are on Berkeley schema.

"id": "d4abec53",
"metadata": {},
"source": [
"Spot check one of those errors."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an issue with the range in the schema that has since been fixed.

@aclum
Copy link
Contributor

aclum commented May 29, 2024

@PeopleMakeCulture it would be very useful to re-run the notebook on Thursday with the version of the schema that is in main, we fixed some of the issues with range and the data will be re-id'd so there won't be an duplicate id values across collections.

@PeopleMakeCulture
Copy link
Collaborator Author

  • Referential Integrity check notebook has been moved to Materialize mongo alldocs collection #550 since it documents the process to generate an alldocs collection

  • RDF gen notebook does not need to be part of the main branch since it is not in use.

Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants