Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pageserver): make barrier waiting for deletion queue #9796

Closed
wants to merge 10 commits into from

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Nov 18, 2024

Problem

Follow up of #9682, that patch didn't fully address the problem: what if shutdown fails due to whatever reason and then we reattach the tenant? Then we will still remove the future layer. The underlying problem is that the fix for #5878 gets voided because of the generation optimizations. We should always wait for deletions for the barrier.

Summary of changes

  • Add a test case to reproduce the behavior (by changing the original test case to attach the same generation).
  • Ensure all uploads after the barrier happen after all deletions before the barrier finish.

@skyzh skyzh requested a review from a team as a code owner November 18, 2024 22:21
@skyzh skyzh requested a review from yliang412 November 18, 2024 22:21
@skyzh skyzh marked this pull request as draft November 18, 2024 22:25
@skyzh skyzh removed the request for review from yliang412 November 18, 2024 22:26
@skyzh skyzh force-pushed the skyzh/fix-barrier branch from e6e5b27 to aa2649c Compare November 18, 2024 22:44
@skyzh skyzh requested review from jcsp and problame November 18, 2024 22:44
@skyzh skyzh marked this pull request as ready for review November 18, 2024 22:44
@@ -1242,19 +1246,20 @@ impl RemoteTimelineClient {
Ok(())
}

pub(crate) fn schedule_barrier(self: &Arc<Self>) -> anyhow::Result<()> {
pub(crate) fn schedule_barrier(self: &Arc<Self>, initial_barrier: bool) -> anyhow::Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to have schedule_barrier_initial instead of adding a boolean parameter everywhere.

)
.await
.map_err(|e| anyhow::anyhow!(e)),
// Barrier flushes up the deletion queue. Usually, we don't wait until deletion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Barrier flushes up the deletion queue. Usually, we don't wait until deletion
// Barrier flushes the deletion queue. Usually, we don't wait until deletion

Copy link

github-actions bot commented Nov 18, 2024

5544 tests run: 5210 passed, 108 failed, 226 skipped (full report)


Failures on Postgres 17

Failures on Postgres 16

Failures on Postgres 15

Failures on Postgres 14

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_emergency_mode[release-pg14] or test_emergency_mode[release-pg14] or test_broker[release-pg14] or test_broker[release-pg14] or test_wal_removal[release-pg14-True] or test_wal_removal[release-pg14-True] or test_wal_removal[release-pg14-False] or test_wal_removal[release-pg14-False] or test_pull_timeline_gc[release-pg14] or test_pull_timeline_gc[release-pg14] or test_pull_timeline_while_evicted[release-pg14] or test_pull_timeline_while_evicted[release-pg14] or test_s3_eviction[release-pg14-0.0-False] or test_s3_eviction[release-pg14-0.0-False] or test_s3_eviction[release-pg14-0.2-False] or test_s3_eviction[release-pg14-0.2-False] or test_s3_eviction[release-pg14-0.0-True] or test_s3_eviction[release-pg14-0.0-True] or test_backup_partial_reset[release-pg14] or test_backup_partial_reset[release-pg14] or test_pull_timeline_partial_segment_integrity[release-pg14] or test_pull_timeline_partial_segment_integrity[release-pg14] or test_s3_eviction[release-pg14-0.2-True] or test_s3_eviction[release-pg14-0.2-True] or test_emergency_mode[release-pg15] or test_emergency_mode[release-pg15] or test_wal_removal[release-pg15-False] or test_wal_removal[release-pg15-False] or test_broker[release-pg15] or test_broker[release-pg15] or test_wal_removal[release-pg15-True] or test_wal_removal[release-pg15-True] or test_pull_timeline_while_evicted[release-pg15] or test_pull_timeline_while_evicted[release-pg15] or test_pull_timeline_gc[release-pg15] or test_pull_timeline_gc[release-pg15] or test_s3_eviction[release-pg15-0.0-False] or test_s3_eviction[release-pg15-0.0-False] or test_s3_eviction[release-pg15-0.2-False] or test_s3_eviction[release-pg15-0.2-False] or test_pull_timeline_partial_segment_integrity[release-pg15] or test_pull_timeline_partial_segment_integrity[release-pg15] or test_s3_eviction[release-pg15-0.0-True] or test_s3_eviction[release-pg15-0.0-True] or test_backup_partial_reset[release-pg15] or test_backup_partial_reset[release-pg15] or test_s3_eviction[release-pg15-0.2-True] or test_s3_eviction[release-pg15-0.2-True] or test_emergency_mode[release-pg16] or test_emergency_mode[release-pg16] or test_broker[release-pg16] or test_broker[release-pg16] or test_wal_removal[release-pg16-False] or test_wal_removal[release-pg16-False] or test_wal_removal[release-pg16-True] or test_wal_removal[release-pg16-True] or test_pull_timeline_gc[release-pg16] or test_pull_timeline_gc[release-pg16] or test_pull_timeline_while_evicted[release-pg16] or test_pull_timeline_while_evicted[release-pg16] or test_s3_eviction[release-pg16-0.2-True] or test_s3_eviction[release-pg16-0.2-True] or test_s3_eviction[release-pg16-0.0-True] or test_s3_eviction[release-pg16-0.0-True] or test_pull_timeline_partial_segment_integrity[release-pg16] or test_pull_timeline_partial_segment_integrity[release-pg16] or test_s3_eviction[release-pg16-0.2-False] or test_s3_eviction[release-pg16-0.2-False] or test_backup_partial_reset[release-pg16] or test_backup_partial_reset[release-pg16] or test_s3_eviction[release-pg16-0.0-False] or test_s3_eviction[release-pg16-0.0-False] or test_emergency_mode[release-pg17] or test_emergency_mode[release-pg17] or test_emergency_mode[debug-pg17] or test_pageserver_small_inmemory_layers[debug-pg17-True] or test_wal_removal[release-pg17-True] or test_wal_removal[release-pg17-True] or test_wal_removal[debug-pg17-True] or test_wal_removal[release-pg17-False] or test_wal_removal[release-pg17-False] or test_wal_removal[debug-pg17-False] or test_pull_timeline_gc[release-pg17] or test_pull_timeline_gc[release-pg17] or test_pull_timeline_gc[debug-pg17] or test_pull_timeline_while_evicted[release-pg17] or test_pull_timeline_while_evicted[release-pg17] or test_pull_timeline_while_evicted[debug-pg17] or test_broker[release-pg17] or test_broker[release-pg17] or test_broker[debug-pg17] or test_s3_eviction[release-pg17-0.0-False] or test_s3_eviction[release-pg17-0.0-False] or test_s3_eviction[debug-pg17-0.0-False] or test_pull_timeline_partial_segment_integrity[release-pg17] or test_pull_timeline_partial_segment_integrity[release-pg17] or test_backup_partial_reset[release-pg17] or test_backup_partial_reset[release-pg17] or test_backup_partial_reset[debug-pg17] or test_s3_eviction[release-pg17-0.0-True] or test_s3_eviction[release-pg17-0.0-True] or test_s3_eviction[debug-pg17-0.0-True] or test_s3_eviction[release-pg17-0.2-True] or test_s3_eviction[release-pg17-0.2-True] or test_s3_eviction[debug-pg17-0.2-True] or test_s3_eviction[release-pg17-0.2-False] or test_s3_eviction[release-pg17-0.2-False] or test_s3_eviction[debug-pg17-0.2-False]"
Flaky tests (2)

Postgres 15

Test coverage report is not available

The comment gets automatically updated with the latest test results
499105d at 2024-11-21T20:06:24.118Z :recycle:

@problame problame removed their request for review November 19, 2024 11:28
@problame
Copy link
Contributor

Removing myself from review, Arpad has more context

@skyzh
Copy link
Member Author

skyzh commented Nov 19, 2024

test failure around test_timeline_retain_lsn seems like a race condition: when we are still preloading the timeline and scanning the files, deletion executor removes all files scheduled for deletion before the pageserver restart (how is this possible??); preload first lists the files and then do file_metadata on each of them, in-between the file gets removed by the deletion executor, therefore reporting file not found error.

Given the the behavior of a barrier is to trigger deletion, this makes me wonder (1) is our current implementation correct? why we will delete files before preload completes? (2) is this patch correct? i start doubting whether it's a good idea to have such waiting for deletion queue semantics for the initial barrier...

@skyzh
Copy link
Member Author

skyzh commented Nov 19, 2024

test_timeline_retain_lsn tests should pass now, will look into timeouts on emergency mode later; this patch is a little bit desperate to work on, context switching to other things for now :(

@skyzh skyzh force-pushed the skyzh/fix-barrier branch from ec9fcbd to cdde254 Compare November 19, 2024 16:58
/// The boolean value indicates whether the barrier is an initial barrier scheduled
/// at timeline load -- if yes, we will need to wait for all deletions to be completed
/// before the next upload.
Barrier(tokio::sync::watch::Sender<()>, bool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a descriptive enum rather than a bool here, to make it easier to read & harder to typo

Copy link
Collaborator

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR convinces me that:
A) this change improves safety 👍
B) we should later address the underlying complexity somehow, perhaps by modifying the ancestor detach to include a generation increment (the request flows through the storage controller so this is possible)

@skyzh
Copy link
Member Author

skyzh commented Nov 20, 2024

2c4829c fixes the panic in test_detach_while_attaching

d710f00 adds a fastpath to delete executor flush so that it won't validate the deletion batch if there are no pending lists.

@skyzh
Copy link
Member Author

skyzh commented Nov 21, 2024

d710f00 changed the emergency mode behavior not to wait for uploads

@skyzh
Copy link
Member Author

skyzh commented Nov 21, 2024

the new patch #9844 to fully address the problem in the remote client

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants