-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen #521
Conversation
Rendered notebooksHere are links to the notebooks being introduced via this PR, in their rendered form. I'm adding these links here because GitHub's "Files changed" tab shows the notebooks in source code form instead of in rendered form. @eecavanna I've updated these links to point to rendered notebooks on this branch #318 |
TODOs before merging
Non-actionable notes:
|
@eecavanna @turbomam The "Mongo-to-RDF transformer" notebook should now run on other machines. Would either of you care to give it a test run? You can also skip the Docker steps once the RDF is generated and load it to another graph db server. |
Hi Folks, Requesting your review to check that these notebooks:
Please let me know if you run into any issues with the notebooks |
I added some comments and TODOs in a commit just now. I haven't reviewed the "Check referential integrity" section or anything below that yet. I'll continue reviewing this during the week. In the meantime, you can continue making changes to the notebook (e.g. adding explanatory comments). I also haven't tried running it locally yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks great @PeopleMakeCulture!! I'm running the metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb
notebook as part of my code review and I seem to running into an error in the block of code that is looking to create/materialize the alldocs
collection.
The error I'm running into is this (truncated stacktrace):
ValueError: Unknown argument: quality_control_report = {'status': 'pass'}
I have nmdc_schema==10.2.0
in my virtual environment.
Can you help debug this error?
The |
Looks like I had restored documents from an older version/dump of the Mongo database which is why the block of code that I reported above was erroring out. I can confirm that I was able to successfully restore all documents from the latest dump of Mongo that you sent to me. Thank you for the help @PeopleMakeCulture 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this sparql notebook need to be reviewed? I thought we were using a strictly python approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be ignored for now
"source": [ | ||
"Determine the name of each Mongo collection in which at least one document has a field named `id`.\n", | ||
"\n", | ||
"> **TODO:** Documents in the [`functional_annotation_agg` collection](https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/) do not have a field named `id`, and so will not be included here. Document the author's rationale for omitting it.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to include the functional_annotation_agg collection and the metap_gene_function_aggregation (not defined in the schema)
} | ||
], | ||
"source": [ | ||
"# check these slots for null values for all docs in collection_names\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this b/c the nulls cause problesm for rdf generation?
" # and insert the resulting document into the `alldocs` collection. Note that we are not\n", | ||
" # relying on the original value of the `type` field, since it's unreliable (see below).\n", | ||
" \n", | ||
" # NOTE: `type` is currently a string, does not exist for all classes, and can have typos. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type
currently being a string caused a lot of problems with re-iding. Current recommendation is infer type from collection name until we are on Berkeley schema.
"id": "d4abec53", | ||
"metadata": {}, | ||
"source": [ | ||
"Spot check one of those errors." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an issue with the range in the schema that has since been fixed.
@PeopleMakeCulture it would be very useful to re-run the notebook on Thursday with the version of the schema that is in main, we fixed some of the issues with range and the data will be re-id'd so there won't be an duplicate |
Closing this PR. |
Description
This PR add two notebooks for prototyping validation solutions to the
nmdc-runtime
repoRelated to #318
Type of change
How Has This Been Tested?
Notebooks have been run locally