-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pageserver): make barrier waiting for deletion queue #9796
Conversation
e6e5b27
to
aa2649c
Compare
@@ -1242,19 +1246,20 @@ impl RemoteTimelineClient { | |||
Ok(()) | |||
} | |||
|
|||
pub(crate) fn schedule_barrier(self: &Arc<Self>) -> anyhow::Result<()> { | |||
pub(crate) fn schedule_barrier(self: &Arc<Self>, initial_barrier: bool) -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better to have schedule_barrier_initial
instead of adding a boolean parameter everywhere.
) | ||
.await | ||
.map_err(|e| anyhow::anyhow!(e)), | ||
// Barrier flushes up the deletion queue. Usually, we don't wait until deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Barrier flushes up the deletion queue. Usually, we don't wait until deletion | |
// Barrier flushes the deletion queue. Usually, we don't wait until deletion |
5544 tests run: 5210 passed, 108 failed, 226 skipped (full report)Failures on Postgres 17
Failures on Postgres 16
Failures on Postgres 15
Failures on Postgres 14
Flaky tests (2)Postgres 15
Test coverage report is not availableThe comment gets automatically updated with the latest test results
499105d at 2024-11-21T20:06:24.118Z :recycle: |
Removing myself from review, Arpad has more context |
test failure around Given the the behavior of a barrier is to trigger deletion, this makes me wonder (1) is our current implementation correct? why we will delete files before preload completes? (2) is this patch correct? i start doubting whether it's a good idea to have such waiting for deletion queue semantics for the initial barrier... |
|
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
ec9fcbd
to
cdde254
Compare
/// The boolean value indicates whether the barrier is an initial barrier scheduled | ||
/// at timeline load -- if yes, we will need to wait for all deletions to be completed | ||
/// before the next upload. | ||
Barrier(tokio::sync::watch::Sender<()>, bool), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a descriptive enum
rather than a bool here, to make it easier to read & harder to typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR convinces me that:
A) this change improves safety 👍
B) we should later address the underlying complexity somehow, perhaps by modifying the ancestor detach to include a generation increment (the request flows through the storage controller so this is possible)
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
Signed-off-by: Alex Chi Z <[email protected]>
d710f00 changed the emergency mode behavior not to wait for uploads |
Signed-off-by: Alex Chi Z <[email protected]>
the new patch #9844 to fully address the problem in the remote client |
Problem
Follow up of #9682, that patch didn't fully address the problem: what if shutdown fails due to whatever reason and then we reattach the tenant? Then we will still remove the future layer. The underlying problem is that the fix for #5878 gets voided because of the generation optimizations. We should always wait for deletions for the barrier.
Summary of changes