Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tell whether a database has changed since last backup #8

Open
davenquinn opened this issue Nov 2, 2021 · 2 comments
Open

Tell whether a database has changed since last backup #8

davenquinn opened this issue Nov 2, 2021 · 2 comments

Comments

@davenquinn
Copy link
Member

We have a lot of databases we'd like to keep backed up, but many of them don't change often. It would be nice if there was a deterministic or at least reasonably-accurate way to determine if a database has changed recently. This would keep us from filling up our backup space with similar database dumps.

Right now, we compute md5 hashes of final database backups. However, pg_dump -Fc ("custom format" backups) includes information that is not the same between dumps, so the hashes aren't stable even for backups generated seconds apart. Text dumps (without -Fc) can have stable hashes, but you lose significant capabilities in restore (for instance, the ability to restore only some tables, change table ownership, or manipulate permissions). I'd like to keep all backups as custom format, if possible, to simplify restoration.

One approach would be to generate a text dump locally from each custom-format dump created, for the purposes of hashing. However, this would effectively dump each database twice, which would be insane for large databases.

Another option would be to use the database system's internal functions to figure out the last transaction ID/time/something and generate a hash from that. I think there are solutions for this (especially for recent PostgreSQL) but they may not be available in default configurations. The big advantage of figuring something like this out would be that such a pre-backup check would be extremely quick and save us from a lot of extra work. It'd have to avoid false negatives though!

@davenquinn
Copy link
Member Author

davenquinn commented Nov 3, 2021

A few additional thoughts here:

  1. Using the --section flag with the custom format still does not create stable hashes, e.g.,
> dump-macrostrat
Dumping section pre-data
f1ad34952762f383f94737760510eaf5  -
Dumping section data
115d1164a745a08ff66a42f759b3f3ae  -
Dumping section post-data
d57e6a976cae674dc1ecb0853d60caa4  -
  1. Tracking commit timestamps may be the way to go here (see 3rd answer in this stackoverflow post) but it requires a config to be set at the database cluster level.

  2. Getting the size of all tables before dumping could be a good approach (although stealthy changes could arise if data is modified internally):

select sum(pg_table_size((schemaname || '.'||tablename)::regclass))
from pg_tables;

@davenquinn
Copy link
Member Author

davenquinn commented Mar 7, 2022

One option here is that we could allow a user-specified query per database that would check if important things have changed internally. That result set could be used for hash generation and database backup skipped if the hash was the same as an already-existing backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant