Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metaissue:referential integrity checks via runtime API is needed #318

Open
aclum opened this issue Sep 28, 2023 · 10 comments · May be fixed by #644
Open

Metaissue:referential integrity checks via runtime API is needed #318

aclum opened this issue Sep 28, 2023 · 10 comments · May be fixed by #644
Assignees
Labels

Comments

@aclum
Copy link
Contributor

aclum commented Sep 28, 2023

As part of the re-iding squad we discovered over 7,000 documents in mongo that had issues with referential integrity, that is an id was reference in has_input, has_output, was_informed_by, part_of that did not exist. We've had issues with records not being completely deleted, we discovered some bioscales reads based analysis records that were not deleted as part of the re-iding done for GSP and with study identifiers being referenced in biosample records without the study_set record existing yet. Based on that, we need referential integrity checks on all action types (insertion, deletion, update).

cc @shreddd @mbthornton-lbl @dwinston @eecavanna

@turbomam
Copy link
Member

Let's stay in touch with @cmungall or @pkalita-lbl because there is some enhanced LInkML validation tooling in the works.

@dwinston
Copy link
Collaborator

/cc @PeopleMakeCulture

@PeopleMakeCulture
Copy link
Collaborator

Sounds like this ticket and #401 are the top two priorities coming out of Thursday's db sync. My concerns are:

  1. Making sure we're all aligned on one strategy for validation and referential integrity
  2. Understanding who will be responsible for which parts, which will inform how @dwinston and I prioritize our time

@eecavanna raised the possibility of organizing a validation & referential integrity squad. Would that be the best way forward?

@PeopleMakeCulture PeopleMakeCulture moved this from Bench to Injured List in Polyneme mixset Apr 26, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from Injured List to At bat in Polyneme mixset Apr 30, 2024
@PeopleMakeCulture
Copy link
Collaborator

From Slack:

Additional validation during DB inserts:

  • Validation is focused around the validate_json function in nmdc_runtime.util, which in turn dispatches to an OverlayDB (class in the same module) context manager that wraps the main pymongo.database.Database singleton.
  • We can add stronger validation there. We will also want to refactor nmdc_runtime.api.core.metadata._validate_changesheet to dispatch to an OverlayDB so that the stronger validation covers changesheet submissions as well.
  • I think it also makes sense to add a materialized-from-mongo SPARQL view specifically to aid with internal validation checks, separate from its potential external utility as an exposed endpoint (and I recall that improved queryability was another takeaway from the discussion).

@dwinston to draft a PR to add stronger mongo-change validation to the runtime.

@PeopleMakeCulture
Copy link
Collaborator

PeopleMakeCulture commented May 1, 2024

@PeopleMakeCulture
Copy link
Collaborator

@eecavanna @dwinston Notebooks have been added in #521

@PeopleMakeCulture PeopleMakeCulture moved this from At bat to On base in Polyneme mixset May 28, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from On base to At bat in Polyneme mixset May 28, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from At bat to On base in Polyneme mixset Jun 6, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from On base to At bat in Polyneme mixset Jun 6, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from At bat to On base in Polyneme mixset Jun 13, 2024
dwinston added a commit that referenced this issue Aug 20, 2024
@dwinston dwinston linked a pull request Aug 20, 2024 that will close this issue
9 tasks
@dwinston
Copy link
Collaborator

dwinston commented Oct 11, 2024

These are the relevant entrypoints that need referential integrity added:

Sprint 1

  • POST /metadata/json:validate
  • POST /metadata/json:submit
  • POST /workflows/workflow_executions

Sprint 2

  • POST /metadata/changesheets:submit
  • POST /metadata/changesheets:validate
  • POST /queries:run

Sprint 3

  • dagster job: apply_metadata_in

@PeopleMakeCulture
Copy link
Collaborator

@dwinston what is the num of seconds POST /metadata/json:validate or POST /metadata/json:submit will wait before timing out?

@PeopleMakeCulture
Copy link
Collaborator

These are the relevant entrypoints that need referential integrity added:

Sprint 1

* [ ]  `POST /metadata/json:validate`[ ]  `POST /metadata/json:submit`[ ]  `POST /workflows/workflow_executions`

Sprint 2

* [ ]  `POST /metadata/changesheets:submit`[ ]  `POST /metadata/changesheets:validate`[ ]  `POST /queries:run`

Sprint 3

* [ ]  dagster job: `apply_metadata_in`

@aclum this is our current working timeline, with Sprint 1 being now through the team retreat, and Sprint 2 being the two weeks following the retreat

@aclum
Copy link
Contributor Author

aclum commented Oct 11, 2024

I would like confirmation that there are no more false positives with this implementation, especially b/c there has been drift from what was done in the notebooks, before this check is added to existing endpoints. Would it be possible to run the code that will be implemented on the existing records in mongo? The current expected results is that 33 records mention an id which does not exist.

@PeopleMakeCulture PeopleMakeCulture changed the title referential integrity checks via runtime API is needed [Parent] referential integrity checks via runtime API is needed Oct 30, 2024
@PeopleMakeCulture PeopleMakeCulture changed the title [Parent] referential integrity checks via runtime API is needed Metaissue:referential integrity checks via runtime API is needed Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Front of house
Development

Successfully merging a pull request may close this issue.

4 participants