Ingest in transaction #822

mcovalt · 2023-01-02T23:48:39Z

This PR changes the ingestion to work inside of a transaction on a single database rather than via an A/B database.

Intro

Previously: We had two databases. The web server would operate on one database and ingestion would clear-then-load the other. Data that we persisted was copied from the live database to the other. We'd rotate the databases and re-deploy after a successful ingest.

With this PR: There is only one database. The ingestion opens a transaction and clears the database without locking reads. Changes from the ingestion are visible without intervention on the web server if ingestion succeeds. Changes are discarded if ingestion fails or disconnects.

Goals

The long-term goals of this change are:

Supporting the ability to persist the mga_gene_function table without a full ingest. Re-creating the mga_gene_function table takes 48+ hours. While this PR will still clear the mga_gene_function table without a full ingest, it lays the groundwork necessary to build a solution that persists this table without a full ingest. This would allow this table to be stale, but refresh other data (a partial ingest only takes ~1 hour). Work is needed to selectively drop rows in the mga_gene_function that violate FK constraints in the presence of altered dependent data.
Simplify operations. The A/B database is unconventional and difficult to reason about. This is particularly true in the presence of data that must be persisted which increases the risk of data loss.

Cons

This approach, however, has a con: database "bloat" is likely to accrue. With the A/B database, tables were truncated and therefore all indexes were rebuilt and no dead tuples remained after ingestion. Now, we DELETE FROM and we'll have dead tuples. Conventional tools can combat this. The scheduled VACUUM that's already active will take care of a lot of this. A VACUUM FULL will achieve the same results as the A/B approach, but does lock reads during it's process. There's pg_repack which can be used to VACUUM FULL without the lock.

mcovalt · 2023-01-02T23:51:22Z

The really big diff is mainly due to e55995a. There I'm making all foreign keys be deferrable. This allows us to DELETE FROM all tables without having to care about the order in which we do so.

mcovalt · 2023-01-03T18:46:22Z

nmdc_server/cli.py

+    with session_maker.begin() as db:  # type: ignore
+        database.get_ingest_lock(db)
+        alembic_cfg = Config(str(Path(__file__).parent / "alembic.ini"))
+        alembic_cfg.set_main_option("script_location", str(Path(__file__).parent / "migrations"))
+        alembic_cfg.set_main_option("sqlalchemy.url", database_uri)
+        alembic_cfg.attributes["configure_logger"] = True
+        command.upgrade(alembic_cfg, "head")


This invariably runs the migration which results in a no-op if no migrations are required. However, the migration command must acquire the ingest lock.

Since our container invokes the nmdc-server migrate command on startup, a currently running ingestion will block the container from even starting as the migrate command will throw an error if the lock cannot be acquired.

This is desirable behavior if migrations must be applied, but bad if the migration command would be a no-op. We need to change this to only conditionally apply migrations if required. Luckily, this has been implemented in the past.

shreddd · 2023-01-03T21:30:31Z

Since this is a relatively significant update to the ingest process, I'd like to request that we discuss the change at the next Kitware / NMDC sync before rolling into production.

mcovalt added 6 commits January 2, 2023 16:35

Remove A/B database

84941a2

Use native PostgreSQL locks

47e2c5b

Make foreign keys deferrable

e55995a

Move truncate/clear tables to own function

07b7476

Update top-level ingestion routines to use single database

4ccf7e7

Switch ingestion routines to use transaction ingestion

bd63d4e

mcovalt added 2 commits January 2, 2023 19:00

Lint

88a6c55

Merge branch 'main' into ingest-in-transaction

8a97b50

mcovalt commented Jan 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest in transaction #822

Ingest in transaction #822

mcovalt commented Jan 2, 2023

mcovalt commented Jan 2, 2023

mcovalt Jan 3, 2023

shreddd commented Jan 3, 2023

Ingest in transaction #822

Are you sure you want to change the base?

Ingest in transaction #822

Conversation

mcovalt commented Jan 2, 2023

Intro

Goals

Cons

mcovalt commented Jan 2, 2023

mcovalt Jan 3, 2023

Choose a reason for hiding this comment

shreddd commented Jan 3, 2023