[DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen #521

PeopleMakeCulture · 2024-05-11T18:32:24Z

Description

This PR add two notebooks for prototyping validation solutions to the nmdc-runtime repo

Related to #318

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Notebooks have been run locally

eecavanna · 2024-05-11T18:43:10Z

Rendered notebooks

Here are links to the notebooks being introduced via this PR, in their rendered form. I'm adding these links here because GitHub's "Files changed" tab shows the notebooks in source code form instead of in rendered form.

@eecavanna I've updated these links to point to rendered notebooks on this branch #318

PeopleMakeCulture · 2024-05-11T18:53:37Z

TODOs before merging

remove tmp code
add set-up instructions
make edits from @aclum

pre-cleaning step needs these fields 1) metagenome_annotation_id 2) metaproteomic_analysis_id

Non-actionable notes:

Type is not on all classes, and the value of type is currently a string and can have typos. Both of these are fixed in berkeley schema but is risky to use at this time
There are some IDs in the current mongo that are not unique across collections, this will be fixed by re-iding

PeopleMakeCulture · 2024-05-14T20:12:07Z

@eecavanna @turbomam The "Mongo-to-RDF transformer" notebook should now run on other machines. Would either of you care to give it a test run?

You can also skip the Docker steps once the RDF is generated and load it to another graph db server.

PeopleMakeCulture · 2024-05-16T20:12:05Z

Hi Folks,

Requesting your review to check that these notebooks:

contain the correct logic
contain adequate documentation
can be run in your dev environment

Please let me know if you run into any issues with the notebooks

eecavanna · 2024-05-27T21:06:33Z

I added some comments and TODOs in a commit just now. I haven't reviewed the "Check referential integrity" section or anything below that yet. I'll continue reviewing this during the week. In the meantime, you can continue making changes to the notebook (e.g. adding explanatory comments). I also haven't tried running it locally yet.

sujaypatil96

This code looks great @PeopleMakeCulture!! I'm running the metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb notebook as part of my code review and I seem to running into an error in the block of code that is looking to create/materialize the alldocs collection.

The error I'm running into is this (truncated stacktrace):

ValueError:  Unknown argument: quality_control_report = {'status': 'pass'}

I have nmdc_schema==10.2.0 in my virtual environment.

Can you help debug this error?

sujaypatil96 · 2024-05-28T03:51:18Z

The alldocs collection that you are creating as part of the notebooks in this PR will be very useful for my work in the NCBI Export squad as well, so I'm looking forward to using it.

sujaypatil96 · 2024-05-28T19:16:03Z

Looks like I had restored documents from an older version/dump of the Mongo database which is why the block of code that I reported above was erroring out. I can confirm that I was able to successfully restore all documents from the latest dump of Mongo that you sent to me.

Thank you for the help @PeopleMakeCulture 😁

aclum · 2024-05-28T23:59:00Z

metadata-translation/notebooks/ghissue_401_sparql.ipynb

Does this sparql notebook need to be reviewed? I thought we were using a strictly python approach?

It can be ignored for now

aclum · 2024-05-29T00:00:54Z

metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb

+   "source": [
+    "Determine the name of each Mongo collection in which at least one document has a field named `id`.\n",
+    "\n",
+    "> **TODO:** Documents in the [`functional_annotation_agg` collection](https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/) do not have a field named `id`, and so will not be included here. Document the author's rationale for omitting it.\n",


we need to include the functional_annotation_agg collection and the metap_gene_function_aggregation (not defined in the schema)

aclum · 2024-05-29T00:03:52Z

metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb

+    }
+   ],
+   "source": [
+    "# check these slots for null values for all docs in collection_names\n",


Is this b/c the nulls cause problesm for rdf generation?

aclum · 2024-05-29T00:06:37Z

metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb

+    "    # and insert the resulting document into the `alldocs` collection. Note that we are not\n",
+    "    # relying on the original value of the `type` field, since it's unreliable (see below).\n",
+    "    \n",
+    "    # NOTE: `type` is currently a string, does not exist for all classes, and can have typos. \n",


type currently being a string caused a lot of problems with re-iding. Current recommendation is infer type from collection name until we are on Berkeley schema.

aclum · 2024-05-29T00:11:12Z

metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb

+   "id": "d4abec53",
+   "metadata": {},
+   "source": [
+    "Spot check one of those errors."


This was an issue with the range in the schema that has since been fixed.

aclum · 2024-05-29T00:13:15Z

@PeopleMakeCulture it would be very useful to re-run the notebook on Thursday with the version of the schema that is in main, we fixed some of the issues with range and the data will be re-id'd so there won't be an duplicate id values across collections.

PeopleMakeCulture · 2024-06-20T20:44:41Z

Referential Integrity check notebook has been moved to Materialize mongo alldocs collection #550 since it documents the process to generate an alldocs collection
RDF gen notebook does not need to be part of the main branch since it is not in use.

Closing this PR.

Jing and others added 2 commits May 11, 2024 14:19

add notesbooks for mongo validation and RDF gen

a12d843

add .tar .agz to gitignore

71eec27

PeopleMakeCulture requested a review from eecavanna May 11, 2024 18:32

PeopleMakeCulture mentioned this pull request May 11, 2024

Metaissue:referential integrity checks via runtime API is needed #318

Open

add setup instructions

3e03973

PeopleMakeCulture changed the title ~~Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen~~ [DRAFT:DO NOT MERGE] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen May 11, 2024

Jing added 5 commits May 11, 2024 15:19

add comments and formatting

5b69182

more comments

f81dde6

add comments

c3a2707

add instructions for fuseki container

267777e

add line to docker-compose.yml instructions

ab15cc6

dwinston marked this pull request as draft May 16, 2024 19:37

add comments

d04a46e

PeopleMakeCulture changed the title ~~[DRAFT:DO NOT MERGE] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen~~ Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen May 16, 2024

PeopleMakeCulture changed the title ~~Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen~~ [DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen May 16, 2024

PeopleMakeCulture requested review from aclum, sujaypatil96 and dwinston May 16, 2024 20:09

PeopleMakeCulture and others added 2 commits May 16, 2024 19:44

Update comments

4c298ab

Clarify prose and add comments, type hints, and TODOs

e94c0ed

Add TODO about omitting irrelevant fields from alldocs collection

b7e4455

sujaypatil96 reviewed May 28, 2024

View reviewed changes

eecavanna added 2 commits May 27, 2024 22:15

Add comments and prose to final two sections of notebook

f607d9e

Add comments and prose to final two sections of notebook (for reals)

096b88e

aclum reviewed May 28, 2024

View reviewed changes

aclum reviewed May 29, 2024

View reviewed changes

PeopleMakeCulture mentioned this pull request Jun 6, 2024

Persist pared down all_docs collection in mongo #548

Closed

PeopleMakeCulture mentioned this pull request Jun 20, 2024

Include functional_annotation_agg & metap_gene_function_agg collections (not defined in the schema) to referential integrity check #568

Open

PeopleMakeCulture closed this Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen #521

[DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen #521

PeopleMakeCulture commented May 11, 2024

eecavanna commented May 11, 2024 •

edited

Loading

PeopleMakeCulture commented May 11, 2024 •

edited

Loading

PeopleMakeCulture commented May 14, 2024

PeopleMakeCulture commented May 16, 2024 •

edited

Loading

eecavanna commented May 27, 2024

sujaypatil96 left a comment

sujaypatil96 commented May 28, 2024

sujaypatil96 commented May 28, 2024 •

edited

Loading

aclum May 28, 2024

PeopleMakeCulture May 30, 2024

aclum May 29, 2024 •

edited

Loading

aclum May 29, 2024

aclum May 29, 2024

aclum May 29, 2024

aclum commented May 29, 2024

PeopleMakeCulture commented Jun 20, 2024

[DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen #521

[DRAFT] Merge jupyter notebooks for 1) mongo validation and ref integrity check && 2) RDF gen #521

Conversation

PeopleMakeCulture commented May 11, 2024

Description

Type of change

How Has This Been Tested?

eecavanna commented May 11, 2024 • edited Loading

Rendered notebooks

PeopleMakeCulture commented May 11, 2024 • edited Loading

PeopleMakeCulture commented May 14, 2024

PeopleMakeCulture commented May 16, 2024 • edited Loading

eecavanna commented May 27, 2024

sujaypatil96 left a comment

Choose a reason for hiding this comment

sujaypatil96 commented May 28, 2024

sujaypatil96 commented May 28, 2024 • edited Loading

aclum May 28, 2024

Choose a reason for hiding this comment

PeopleMakeCulture May 30, 2024

Choose a reason for hiding this comment

aclum May 29, 2024 • edited Loading

Choose a reason for hiding this comment

aclum May 29, 2024

Choose a reason for hiding this comment

aclum May 29, 2024

Choose a reason for hiding this comment

aclum May 29, 2024

Choose a reason for hiding this comment

aclum commented May 29, 2024

PeopleMakeCulture commented Jun 20, 2024

eecavanna commented May 11, 2024 •

edited

Loading

PeopleMakeCulture commented May 11, 2024 •

edited

Loading

PeopleMakeCulture commented May 16, 2024 •

edited

Loading

sujaypatil96 commented May 28, 2024 •

edited

Loading

aclum May 29, 2024 •

edited

Loading