fix(pageserver): make barrier waiting for deletion queue #9796

skyzh · 2024-11-18T22:21:49Z

Problem

Follow up of #9682, that patch didn't fully address the problem: what if shutdown fails due to whatever reason and then we reattach the tenant? Then we will still remove the future layer. The underlying problem is that the fix for #5878 gets voided because of the generation optimizations. We should always wait for deletions for the barrier.

Summary of changes

Add a test case to reproduce the behavior (by changing the original test case to attach the same generation).
Ensure all uploads after the barrier happen after all deletions before the barrier finish.

arpad-m · 2024-11-18T22:47:55Z

pageserver/src/tenant/remote_timeline_client.rs

@@ -1242,19 +1246,20 @@ impl RemoteTimelineClient {
        Ok(())
    }

-    pub(crate) fn schedule_barrier(self: &Arc<Self>) -> anyhow::Result<()> {
+    pub(crate) fn schedule_barrier(self: &Arc<Self>, initial_barrier: bool) -> anyhow::Result<()> {


it would be better to have schedule_barrier_initial instead of adding a boolean parameter everywhere.

arpad-m · 2024-11-18T22:57:54Z

pageserver/src/tenant/remote_timeline_client.rs

+                    )
+                    .await
+                    .map_err(|e| anyhow::anyhow!(e)),
+                // Barrier flushes up the deletion queue. Usually, we don't wait until deletion


Suggested change

// Barrier flushes up the deletion queue. Usually, we don't wait until deletion

// Barrier flushes the deletion queue. Usually, we don't wait until deletion

github-actions · 2024-11-18T23:40:13Z

5544 tests run: 5210 passed, 108 failed, 226 skipped (full report)

Failures on Postgres 17

test_emergency_mode: release-x86-64, release-arm64, debug-x86-64
test_pageserver_small_inmemory_layers[True]: debug-x86-64
test_wal_removal[True]: release-x86-64, release-arm64, debug-x86-64
test_wal_removal[False]: release-x86-64, release-arm64, debug-x86-64
test_pull_timeline_gc: release-x86-64, release-arm64, debug-x86-64
test_pull_timeline_while_evicted: release-x86-64, release-arm64, debug-x86-64
test_broker: release-x86-64, release-arm64, debug-x86-64
test_s3_eviction[0.0-False]: release-x86-64, release-arm64, debug-x86-64
test_pull_timeline_partial_segment_integrity: release-x86-64, release-arm64
test_backup_partial_reset: release-x86-64, release-arm64, debug-x86-64
test_s3_eviction[0.0-True]: release-x86-64, release-arm64, debug-x86-64
test_s3_eviction[0.2-True]: release-x86-64, release-arm64, debug-x86-64
test_s3_eviction[0.2-False]: release-x86-64, release-arm64, debug-x86-64

Failures on Postgres 16

test_emergency_mode: release-x86-64, release-arm64
test_broker: release-x86-64, release-arm64
test_wal_removal[False]: release-x86-64, release-arm64
test_wal_removal[True]: release-x86-64, release-arm64
test_pull_timeline_gc: release-x86-64, release-arm64
test_pull_timeline_while_evicted: release-x86-64, release-arm64
test_s3_eviction[0.2-True]: release-x86-64, release-arm64
test_s3_eviction[0.0-True]: release-x86-64, release-arm64
test_pull_timeline_partial_segment_integrity: release-x86-64, release-arm64
test_s3_eviction[0.2-False]: release-x86-64, release-arm64
test_backup_partial_reset: release-x86-64, release-arm64
test_s3_eviction[0.0-False]: release-x86-64, release-arm64

Failures on Postgres 15

test_emergency_mode: release-x86-64, release-arm64
test_wal_removal[False]: release-x86-64, release-arm64
test_broker: release-x86-64, release-arm64
test_wal_removal[True]: release-x86-64, release-arm64
test_pull_timeline_while_evicted: release-x86-64, release-arm64
test_pull_timeline_gc: release-x86-64, release-arm64
test_s3_eviction[0.0-False]: release-x86-64, release-arm64
test_s3_eviction[0.2-False]: release-x86-64, release-arm64
test_pull_timeline_partial_segment_integrity: release-x86-64, release-arm64
test_s3_eviction[0.0-True]: release-x86-64, release-arm64
test_backup_partial_reset: release-x86-64, release-arm64
test_s3_eviction[0.2-True]: release-x86-64, release-arm64

Failures on Postgres 14

test_emergency_mode: release-x86-64, release-arm64
test_broker: release-x86-64, release-arm64
test_wal_removal[True]: release-x86-64, release-arm64
test_wal_removal[False]: release-x86-64, release-arm64
test_pull_timeline_gc: release-x86-64, release-arm64
test_pull_timeline_while_evicted: release-x86-64, release-arm64
test_s3_eviction[0.0-False]: release-x86-64, release-arm64
test_s3_eviction[0.2-False]: release-x86-64, release-arm64
test_s3_eviction[0.0-True]: release-x86-64, release-arm64
test_backup_partial_reset: release-x86-64, release-arm64
test_pull_timeline_partial_segment_integrity: release-x86-64, release-arm64
test_s3_eviction[0.2-True]: release-x86-64, release-arm64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_emergency_mode[release-pg14] or test_emergency_mode[release-pg14] or test_broker[release-pg14] or test_broker[release-pg14] or test_wal_removal[release-pg14-True] or test_wal_removal[release-pg14-True] or test_wal_removal[release-pg14-False] or test_wal_removal[release-pg14-False] or test_pull_timeline_gc[release-pg14] or test_pull_timeline_gc[release-pg14] or test_pull_timeline_while_evicted[release-pg14] or test_pull_timeline_while_evicted[release-pg14] or test_s3_eviction[release-pg14-0.0-False] or test_s3_eviction[release-pg14-0.0-False] or test_s3_eviction[release-pg14-0.2-False] or test_s3_eviction[release-pg14-0.2-False] or test_s3_eviction[release-pg14-0.0-True] or test_s3_eviction[release-pg14-0.0-True] or test_backup_partial_reset[release-pg14] or test_backup_partial_reset[release-pg14] or test_pull_timeline_partial_segment_integrity[release-pg14] or test_pull_timeline_partial_segment_integrity[release-pg14] or test_s3_eviction[release-pg14-0.2-True] or test_s3_eviction[release-pg14-0.2-True] or test_emergency_mode[release-pg15] or test_emergency_mode[release-pg15] or test_wal_removal[release-pg15-False] or test_wal_removal[release-pg15-False] or test_broker[release-pg15] or test_broker[release-pg15] or test_wal_removal[release-pg15-True] or test_wal_removal[release-pg15-True] or test_pull_timeline_while_evicted[release-pg15] or test_pull_timeline_while_evicted[release-pg15] or test_pull_timeline_gc[release-pg15] or test_pull_timeline_gc[release-pg15] or test_s3_eviction[release-pg15-0.0-False] or test_s3_eviction[release-pg15-0.0-False] or test_s3_eviction[release-pg15-0.2-False] or test_s3_eviction[release-pg15-0.2-False] or test_pull_timeline_partial_segment_integrity[release-pg15] or test_pull_timeline_partial_segment_integrity[release-pg15] or test_s3_eviction[release-pg15-0.0-True] or test_s3_eviction[release-pg15-0.0-True] or test_backup_partial_reset[release-pg15] or test_backup_partial_reset[release-pg15] or test_s3_eviction[release-pg15-0.2-True] or test_s3_eviction[release-pg15-0.2-True] or test_emergency_mode[release-pg16] or test_emergency_mode[release-pg16] or test_broker[release-pg16] or test_broker[release-pg16] or test_wal_removal[release-pg16-False] or test_wal_removal[release-pg16-False] or test_wal_removal[release-pg16-True] or test_wal_removal[release-pg16-True] or test_pull_timeline_gc[release-pg16] or test_pull_timeline_gc[release-pg16] or test_pull_timeline_while_evicted[release-pg16] or test_pull_timeline_while_evicted[release-pg16] or test_s3_eviction[release-pg16-0.2-True] or test_s3_eviction[release-pg16-0.2-True] or test_s3_eviction[release-pg16-0.0-True] or test_s3_eviction[release-pg16-0.0-True] or test_pull_timeline_partial_segment_integrity[release-pg16] or test_pull_timeline_partial_segment_integrity[release-pg16] or test_s3_eviction[release-pg16-0.2-False] or test_s3_eviction[release-pg16-0.2-False] or test_backup_partial_reset[release-pg16] or test_backup_partial_reset[release-pg16] or test_s3_eviction[release-pg16-0.0-False] or test_s3_eviction[release-pg16-0.0-False] or test_emergency_mode[release-pg17] or test_emergency_mode[release-pg17] or test_emergency_mode[debug-pg17] or test_pageserver_small_inmemory_layers[debug-pg17-True] or test_wal_removal[release-pg17-True] or test_wal_removal[release-pg17-True] or test_wal_removal[debug-pg17-True] or test_wal_removal[release-pg17-False] or test_wal_removal[release-pg17-False] or test_wal_removal[debug-pg17-False] or test_pull_timeline_gc[release-pg17] or test_pull_timeline_gc[release-pg17] or test_pull_timeline_gc[debug-pg17] or test_pull_timeline_while_evicted[release-pg17] or test_pull_timeline_while_evicted[release-pg17] or test_pull_timeline_while_evicted[debug-pg17] or test_broker[release-pg17] or test_broker[release-pg17] or test_broker[debug-pg17] or test_s3_eviction[release-pg17-0.0-False] or test_s3_eviction[release-pg17-0.0-False] or test_s3_eviction[debug-pg17-0.0-False] or test_pull_timeline_partial_segment_integrity[release-pg17] or test_pull_timeline_partial_segment_integrity[release-pg17] or test_backup_partial_reset[release-pg17] or test_backup_partial_reset[release-pg17] or test_backup_partial_reset[debug-pg17] or test_s3_eviction[release-pg17-0.0-True] or test_s3_eviction[release-pg17-0.0-True] or test_s3_eviction[debug-pg17-0.0-True] or test_s3_eviction[release-pg17-0.2-True] or test_s3_eviction[release-pg17-0.2-True] or test_s3_eviction[debug-pg17-0.2-True] or test_s3_eviction[release-pg17-0.2-False] or test_s3_eviction[release-pg17-0.2-False] or test_s3_eviction[debug-pg17-0.2-False]"

Flaky tests (2)

Postgres 15

test_compute_pageserver_connection_stress: release-arm64
test_pull_timeline[True]: release-arm64

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
499105d at 2024-11-21T20:06:24.118Z :recycle:}

problame · 2024-11-19T11:28:29Z

Removing myself from review, Arpad has more context

skyzh · 2024-11-19T16:33:14Z

test failure around test_timeline_retain_lsn seems like a race condition: when we are still preloading the timeline and scanning the files, deletion executor removes all files scheduled for deletion before the pageserver restart (how is this possible??); preload first lists the files and then do file_metadata on each of them, in-between the file gets removed by the deletion executor, therefore reporting file not found error.

Given the the behavior of a barrier is to trigger deletion, this makes me wonder (1) is our current implementation correct? why we will delete files before preload completes? (2) is this patch correct? i start doubting whether it's a good idea to have such waiting for deletion queue semantics for the initial barrier...

skyzh · 2024-11-19T16:56:04Z

test_timeline_retain_lsn tests should pass now, will look into timeouts on emergency mode later; this patch is a little bit desperate to work on, context switching to other things for now :(

Signed-off-by: Alex Chi Z <[email protected]>

jcsp · 2024-11-20T18:37:25Z

pageserver/src/tenant/upload_queue.rs

+    /// The boolean value indicates whether the barrier is an initial barrier scheduled
+    /// at timeline load -- if yes, we will need to wait for all deletions to be completed
+    /// before the next upload.
+    Barrier(tokio::sync::watch::Sender<()>, bool),


Let's use a descriptive enum rather than a bool here, to make it easier to read & harder to typo

jcsp

This PR convinces me that:
A) this change improves safety 👍
B) we should later address the underlying complexity somehow, perhaps by modifying the ancestor detach to include a generation increment (the request flows through the storage controller so this is possible)

Signed-off-by: Alex Chi Z <[email protected]>

…h/fix-barrier

Signed-off-by: Alex Chi Z <[email protected]>

skyzh · 2024-11-20T21:55:58Z

2c4829c fixes the panic in test_detach_while_attaching

d710f00 adds a fastpath to delete executor flush so that it won't validate the deletion batch if there are no pending lists.

…h/fix-barrier

Signed-off-by: Alex Chi Z <[email protected]>

skyzh · 2024-11-21T18:34:24Z

d710f00 changed the emergency mode behavior not to wait for uploads

Signed-off-by: Alex Chi Z <[email protected]>

skyzh · 2024-11-21T20:56:29Z

the new patch #9844 to fully address the problem in the remote client

skyzh requested a review from a team as a code owner November 18, 2024 22:21

skyzh requested a review from yliang412 November 18, 2024 22:21

skyzh marked this pull request as draft November 18, 2024 22:25

skyzh removed the request for review from yliang412 November 18, 2024 22:26

skyzh force-pushed the skyzh/fix-barrier branch from e6e5b27 to aa2649c Compare November 18, 2024 22:44

skyzh requested review from jcsp and problame November 18, 2024 22:44

skyzh marked this pull request as ready for review November 18, 2024 22:44

arpad-m reviewed Nov 18, 2024

View reviewed changes

problame removed their request for review November 19, 2024 11:28

skyzh added 4 commits November 19, 2024 11:56

create the test case to reproduce the issue

45f6111

Signed-off-by: Alex Chi Z <[email protected]>

fix the issue

42ac6f6

Signed-off-by: Alex Chi Z <[email protected]>

fix local fs semantics due to race deletion/download

a9db766

Signed-off-by: Alex Chi Z <[email protected]>

fix type check

cdde254

Signed-off-by: Alex Chi Z <[email protected]>

skyzh force-pushed the skyzh/fix-barrier branch from ec9fcbd to cdde254 Compare November 19, 2024 16:58

jcsp reviewed Nov 20, 2024

View reviewed changes

jcsp approved these changes Nov 20, 2024

View reviewed changes

skyzh added 3 commits November 20, 2024 16:34

fix assertions

2c4829c

Signed-off-by: Alex Chi Z <[email protected]>

Merge branch 'main' of https://github.com/neondatabase/neon into skyz…

feaeba3

…h/fix-barrier

fix test_emergency_mode

d710f00

Signed-off-by: Alex Chi Z <[email protected]>

skyzh added 2 commits November 21, 2024 12:53

Merge branch 'main' of https://github.com/neondatabase/neon into skyz…

b501f1a

…h/fix-barrier

do not wait for checkpoint in emergency mode

95474cf

Signed-off-by: Alex Chi Z <[email protected]>

passthrough wait_for_upload, better upload scheduling

499105d

Signed-off-by: Alex Chi Z <[email protected]>

skyzh closed this Nov 21, 2024

jcsp mentioned this pull request Nov 22, 2024

fix(pageserver): ensure upload happens after delete #9844

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pageserver): make barrier waiting for deletion queue #9796

fix(pageserver): make barrier waiting for deletion queue #9796

skyzh commented Nov 18, 2024

arpad-m Nov 18, 2024

arpad-m Nov 18, 2024

github-actions bot commented Nov 18, 2024 •

edited

Loading

Postgres 15

problame commented Nov 19, 2024

skyzh commented Nov 19, 2024

skyzh commented Nov 19, 2024

jcsp Nov 20, 2024

jcsp left a comment

skyzh commented Nov 20, 2024

skyzh commented Nov 21, 2024

skyzh commented Nov 21, 2024

	// Barrier flushes up the deletion queue. Usually, we don't wait until deletion
	// Barrier flushes the deletion queue. Usually, we don't wait until deletion

fix(pageserver): make barrier waiting for deletion queue #9796

fix(pageserver): make barrier waiting for deletion queue #9796

Conversation

skyzh commented Nov 18, 2024

Problem

Summary of changes

arpad-m Nov 18, 2024

Choose a reason for hiding this comment

arpad-m Nov 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Nov 18, 2024 • edited Loading

5544 tests run: 5210 passed, 108 failed, 226 skipped (full report)

Failures on Postgres 17

Failures on Postgres 16

Failures on Postgres 15

Failures on Postgres 14

Postgres 15

Test coverage report is not available

problame commented Nov 19, 2024

skyzh commented Nov 19, 2024

skyzh commented Nov 19, 2024

jcsp Nov 20, 2024

Choose a reason for hiding this comment

jcsp left a comment

Choose a reason for hiding this comment

skyzh commented Nov 20, 2024

skyzh commented Nov 21, 2024

skyzh commented Nov 21, 2024

github-actions bot commented Nov 18, 2024 •

edited

Loading