-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest in transaction #822
base: main
Are you sure you want to change the base?
Conversation
The really big diff is mainly due to e55995a. There I'm making all foreign keys be deferrable. This allows us to |
with session_maker.begin() as db: # type: ignore | ||
database.get_ingest_lock(db) | ||
alembic_cfg = Config(str(Path(__file__).parent / "alembic.ini")) | ||
alembic_cfg.set_main_option("script_location", str(Path(__file__).parent / "migrations")) | ||
alembic_cfg.set_main_option("sqlalchemy.url", database_uri) | ||
alembic_cfg.attributes["configure_logger"] = True | ||
command.upgrade(alembic_cfg, "head") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This invariably runs the migration which results in a no-op if no migrations are required. However, the migration command must acquire the ingest lock.
Since our container invokes the nmdc-server migrate
command on startup, a currently running ingestion will block the container from even starting as the migrate command will throw an error if the lock cannot be acquired.
This is desirable behavior if migrations must be applied, but bad if the migration command would be a no-op. We need to change this to only conditionally apply migrations if required. Luckily, this has been implemented in the past.
Since this is a relatively significant update to the ingest process, I'd like to request that we discuss the change at the next Kitware / NMDC sync before rolling into production. |
This PR changes the ingestion to work inside of a transaction on a single database rather than via an A/B database.
Intro
Previously: We had two databases. The web server would operate on one database and ingestion would clear-then-load the other. Data that we persisted was copied from the live database to the other. We'd rotate the databases and re-deploy after a successful ingest.
With this PR: There is only one database. The ingestion opens a transaction and clears the database without locking reads. Changes from the ingestion are visible without intervention on the web server if ingestion succeeds. Changes are discarded if ingestion fails or disconnects.
Goals
The long-term goals of this change are:
mga_gene_function
table without a full ingest. Re-creating themga_gene_function
table takes 48+ hours. While this PR will still clear themga_gene_function
table without a full ingest, it lays the groundwork necessary to build a solution that persists this table without a full ingest. This would allow this table to be stale, but refresh other data (a partial ingest only takes ~1 hour). Work is needed to selectively drop rows in themga_gene_function
that violate FK constraints in the presence of altered dependent data.Cons
This approach, however, has a con: database "bloat" is likely to accrue. With the A/B database, tables were truncated and therefore all indexes were rebuilt and no dead tuples remained after ingestion. Now, we
DELETE FROM
and we'll have dead tuples. Conventional tools can combat this. The scheduledVACUUM
that's already active will take care of a lot of this. AVACUUM FULL
will achieve the same results as the A/B approach, but does lock reads during it's process. There's pg_repack which can be used toVACUUM FULL
without the lock.