Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a transaction when applying migrations #1822

Merged
merged 1 commit into from
Jul 2, 2024

Conversation

Kobzol
Copy link
Contributor

@Kobzol Kobzol commented Jun 17, 2024

Recently, migrations in triagebot have been causing some issues. When I checked the code I noticed that transactions weren't used, which caused two issues:

  1. The migration command could have succeeded partially. In one PR, the command contained several SQL queries, where the first one has succeeded, and the rest of them didn't (in fact they could not succeed, because multiple statements are not supported in a single call to execute). This would be rolled back if it was wrapped in a transaction.
  2. It was possible that a migration has happened, but the ID wasn't updated in the DB, for some reason.

This PR wraps each migration in a separate transaction.

CC @apiraino

@Kobzol Kobzol requested a review from Mark-Simulacrum June 17, 2024 14:28
@Kobzol Kobzol force-pushed the db-migrations-tx branch from 60d519b to 6ca524e Compare June 17, 2024 14:29
@apiraino
Copy link
Contributor

apiraino commented Jul 2, 2024

pls can anyone review this? It's blocking me on #1790 and by extension #1753

thanks

@ehuss
Copy link
Contributor

ehuss commented Jul 2, 2024

I'm not quite following the exact issue this is trying to resolve.

If a migration fails, then it should exit without updating the counter. What was the behavior with #1781 with multiple statements? Did it successfully execute the first statement, and then fail?

I'm curious if you considered wrapping the entire for loop (or the entire function) in one transaction instead of doing each migration in a separate transaction. I think it should work either way (a single transaction might be slightly faster?), would just like to know why this particular approach was chosen.

I'd also like to note that the migration code doesn't look like it handles concurrent processes trying to do migrations. Probably not an issue for triagebot, but I wanted to just note it.

@Kobzol
Copy link
Contributor Author

Kobzol commented Jul 2, 2024

Did it successfully execute the first statement, and then fail?

We don't really know. Currently, based on the state of the DB, it looks like it executed the second statement, and then it failed..

The goal is to make the execution of the migration and the increment of the counter atomic. It is true that if the migration fails, then the counter should not be updated (even without a transaction). However, adding a transaction fixes the following two problems:

  • If the migration fails in some non-atomic way (as we had before - multiple statements), the non-atomic part will be reverted. This would have prevented the previous issue where only a part of a migration was applied. So it's not just about the version increment, but about the migration statement itself.
  • If the migration succeeds, but the increment fails, the migration will be reverted. Although increment failures should be rare, they can still happen, and this makes sure that the DB is kept in a consistent state.

I'm curious if you considered wrapping the entire for loop (or the entire function) in one transaction instead of doing each migration in a separate transaction. I think it should work either way (a single transaction might be slightly faster?), would just like to know why this particular approach was chosen.

I did consider that. I don't think that performance is really important here, usually you just execute 1, maybe 2 migrations, the already applied migrations are not being executed anyway, so there is no permanently increasing performance cost. So I wouldn't personally do it because of performance reasons. My reasoning was that if you have a good migration and then a bad migration, you might want to at least go as far as you can (hence separate migrations). On the other hand, if you apply multiple migrations together, you probably added them together in the same PR, and therefore they are probably related to one another, and therefore if one fails, you might later want to modify all of the newly added migrations in another PR with a fix. So perhaps wrapping everything in one larger transaction makes more sense.

I'd also like to note that the migration code doesn't look like it handles concurrent processes trying to do migrations. Probably not an issue for triagebot, but I wanted to just note it.

You're right. I'm not sure if we deal with this in any of our tools, this is non-trivial to resolve, AFAIK.

@ehuss
Copy link
Contributor

ehuss commented Jul 2, 2024

We don't really know. Currently, based on the state of the DB, it looks like it executed the second statement, and then it failed..

What information do you have that makes it seem that way?

From what I understand:

  1. Migration counter is 11.
  2. Workflow for tracking PRs assignment #1773 Feb 23 added two new migrations.
    • Added table review_prefs
    • Added index review_prefs_user_id
    • Migration counter is now 13.
  3. Add webhook handler to update PR workload queues #1781 modified an existing migration (to add CREATE EXTENSION), and thus the migration did not run. Also, the migration is now invalid since it contains multiple statements.
  4. Assign PR assignment based on work queue #1786
    • Added column max_assigned_prs to review_prefs
    • Migration counter is now 14.
  5. PR assignment based on work queue availability (take #2) #1793 (take 2) tried to fix the migration by splitting the incorrect formatting in Add webhook handler to update PR workload queues #1781. However, this did not work as expected. It just shifted the migrations. The CREATE EXTENSION migration never runs.
    • The ALTER TABLE migration runs again, but does nothing.
    • Migration counter is now 15.
  6. Revert last two prs #1796 (revert last two PRs) Reverts PR assignment based on work queue availability (take #2) #1793.
    • Migration counter is still 15 (unless Mark manually modified it?). It is now out of sync with the source code, which only has 14 migrations. Also, the CREATE EXTENSION migration has never run.

Nowhere along this does it seem like the lack of transactions is the problem.

@Kobzol
Copy link
Contributor Author

Kobzol commented Jul 2, 2024

Thank you for the detailed analysis! ❤️ The counter was indeed 15, Mark now modified it to 14, and the last migration is in effect (the index exists), but the penultimate one isn't (I asked Mark to enable the intarray extension manually though).

In these very specific circumstances, it probably wouldn't have helped to have transactions, but if the multistatement migration was added as a new migration, instead of being modified (and this can happen again in the future), transactions would prevent the error.

But having transactions for migrations is a good idea nevertheless, regardless if it would prevent issues in this specific case.

@ehuss
Copy link
Contributor

ehuss commented Jul 2, 2024

if the multistatement migration was added as a new migration, instead of being modified (and this can happen again in the future), transactions would prevent the error.

Unless I'm misunderstanding, I don't think it would. If you try to run a migration with multiple statements, it would error with ERROR: cannot insert multiple commands into a prepared statement, without running any of the statements. Triagebot would then immediately exit without updating the counter.

@Kobzol
Copy link
Contributor Author

Kobzol commented Jul 2, 2024

I see. I thought it would execute the first statement only. Ok, so it wouldn't help in this case. It's still a good idea to do it though.

Copy link
Contributor

@ehuss ehuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and merge, even though I don't think this will fix any issues we've had. It is appropriate to do changes like this in a transaction, though it will only help if the triagebot executable is killed in-between the two statements, and I think the likelihood of that is near zero.

I did a little bit of scanning to see which kinds of statements postgresql supports in transactions. There are a few that aren't supported (around 11 I could find), but none jump out to me to be ones we would likely ever use.

@ehuss ehuss merged commit 989e03c into rust-lang:master Jul 2, 2024
2 checks passed
@Kobzol Kobzol deleted the db-migrations-tx branch July 2, 2024 17:28
@Kobzol
Copy link
Contributor Author

Kobzol commented Jul 2, 2024

Thanks! I assume you meant "doesn't support in transactions"?

@ehuss
Copy link
Contributor

ehuss commented Jul 2, 2024

Correct, I updated the comment. There are around 11 that aren't supported ("create db", "vacuum", etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants