Correct mistakes in offloaded timeline retain_lsn management #9760

arpad-m · 2024-11-14T15:56:00Z

PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of retain_lsn values. However, there is more places than just tenant activation where one needs to update the retain_lsns.

This PR fixes some bugs of the current code that could lead to corruption in the worst case:

Deleting of an offloaded timeline would not get its retain_lsn purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the tenant is not good either, because we might drop the offloaded timeline in a context where a offloaded_timelines lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline.
Unoffloading a timeline would not get its retain_lsn values populated, leading to it maybe garbage collecting values that its children might need. We now call initialize_gc_info on the parent.
Offloading of a timeline would not get its retain_lsn values registered as offloaded at the parent. So if we drop the Timeline object, and its registration is removed, the parent would not have any of the child's retain_lsns around. Also, before, the Timeline object would delete anything related to its timeline ID, now it only deletes retain_lsns that have MaybeOffloaded::No set.

Incorporates Chi's reproducer from #9753. cc https://github.com/neondatabase/cloud/issues/20199

The test_timeline_retain_lsn test is extended:

it gains a new dimension, duplicating each mode, to either have the "main" branch be the direct parent of the timeline we archive, or the "test_archived_parent" branch intermediary, creating a three timeline structure. This doesn't test anything fixed by this PR in particular, just explores the vast space of possible configurations a little bit more.
it gains two new modes, offload-parent, which tests the second point, and offload-no-restart which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the respective self.initialize_gc_info(), gc_info.insert_child() or ancestor_children.push().

Part of #8088

Signed-off-by: Alex Chi Z <[email protected]>

This doesn't actually reproduce any of the issues I've just fixed. Probably the timeline is still referenced somewhere and thus not dropped?

skyzh

LGTM; overall I feel we might want to revisit the retain_lsn mechanism, manipulating it in Drop doesn't seem like a good idea... As we get more and more timeline states, in the future, probably it would be a good idea to re-compute it every time before gc/compaction?

pageserver/src/tenant.rs

pageserver/src/tenant/timeline.rs

arpad-m · 2024-11-14T16:35:44Z

probably it would be a good idea to re-compute it every time before gc/compaction?

I actually thought that this was already done, due to #9308. Personally I'd be fine with that as well.

skyzh · 2024-11-14T16:46:10Z

probably it would be a good idea to re-compute it every time before gc/compaction?

I actually thought that this was already done, due to #9308. Personally I'd be fine with that as well.

But this is done in initialize_gc_info, which is only called in Tenant::activate; we also need to do that in refresh_gc_info for the future if it's a good idea. The current way of computing is incremental (i.e., only updated when new timelines get created / timelines get dropped), this is efficient but would cause larger engineering overhead to maintain...

arpad-m · 2024-11-14T16:49:15Z

But this is done in initialize_gc_info, which is only called in Tenant::activate;

yeah, that's what I wanted to say: i mistakenly assumed this when I wrote #9308. Now I know better :)

github-actions · 2024-11-14T18:09:10Z

5490 tests run: 5247 passed, 0 failed, 243 skipped (full report)

Flaky tests (1)

Postgres 17

test_cli_start_stop: release-arm64

Code coverage* (full report)

functions: 31.9% (7931 of 24849 functions)
lines: 49.7% (62841 of 126497 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
cef74a5 at 2024-11-15T12:59:31.645Z :recycle:}

Not sure why we can read the basebackup, but shrug.

…gement

…block"

arpad-m · 2024-11-15T13:25:16Z

Documenting the test failures if I comment out the offloading-specific retain_lsn code:

commented out self.initialize_gc_info:

FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - AssertionError: First log error on pageserver_1: (8984, '2024-11-15T13:08:29.472799Z ERROR page_service_conn_main{peer_addr=127.0.0.1:47642}:process_query{tenant_id=55706acf3cbb2e738d341d35f1d49699 timeline_id=8d82c84d21bf67cd8735c4e93c0651db}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - AssertionError: First log error on pageserver_1: (10088, "2024-11-15T13:08:28.585638Z ERROR page_service_conn_main{peer_addr=127.0.0.1:59498}: query handler for 'basebackup 29c9fe6e8eedde931f811b22d663bfe8 7810e79f0135f125419355b95ffb99db --gzip' ...

commented out gc_info.insert_child:

FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-True] - AssertionError: First log error on pageserver_1: (8075, '2024-11-15T13:12:01.817458Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46472}:process_query{tenant_id=1d4832e6363110d56c35508bdf3e872b timeline_id=953b2c11aee998a26c0fae6c112645aa}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-no-restart-False] - AssertionError: First log error on pageserver_1: (8146, "2024-11-15T13:12:03.551280Z ERROR page_service_conn_main{peer_addr=127.0.0.1:36762}: query handler for 'basebackup 0d9e975731b4a6e650a4a0d81871ef64 655e71391cf90d0ee14300c3ca2f78bb --gzip' f...

commented out ancestor_children.push:

FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-True] - psycopg2.errors.IoError: [NEON_SMGR] [shard 0] could not read block 1 in rel 1663/5/16385.0 from page server at lsn 0/015AEA60
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
FAILED test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - RuntimeError: Run ['/home/arpad/src/neon.main/target/debug/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-3'] failed:
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-True] - AssertionError: First log error on pageserver_1: (8412, '2024-11-15T13:19:58.864915Z ERROR page_service_conn_main{peer_addr=127.0.0.1:34142}:process_query{tenant_id=e1222ebe282467f72d560b2d1c3728c1 timeline_id=44086dc2ff71364001364b788cce9d10}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-True] - AssertionError: First log error on pageserver_1: (8989, '2024-11-15T13:19:59.441191Z ERROR page_service_conn_main{peer_addr=127.0.0.1:36232}:process_query{tenant_id=1b4eb5ccc71cfdba923d6a72f7d93d53 timeline_id=88c85775b855f3452b4072232b2a0ce3}:han...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-False] - AssertionError: First log error on pageserver_1: (8423, "2024-11-15T13:19:57.832008Z ERROR page_service_conn_main{peer_addr=127.0.0.1:53884}: query handler for 'basebackup d3dd85530cc06cf768f784b79b3295b5 e706a8c9aef648f0d6fec837abdd8f02 --gzip' f...
ERROR test_runner/regress/test_timeline_archive.py::test_timeline_retain_lsn[debug-pg17-offload-parent-False] - AssertionError: First log error on pageserver_1: (10104, "2024-11-15T13:19:59.341517Z ERROR page_service_conn_main{peer_addr=127.0.0.1:33498}: query handler for 'basebackup 767019d1809ca7794cdc2ed543df2bcf 7f3203f9a17daf030dca1878bb88c564 --gzip' ...

diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 909f99ea9..27de8c2ba 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -542,7 +542,7 @@ fn from_timeline(timeline: &Timeline) -> Result<Self, UploadQueueNotReadyError>
                 let ancestor_lsn = timeline.get_ancestor_lsn();
                 let ancestor_timeline_id = ancestor_timeline.timeline_id;
                 let mut gc_info = ancestor_timeline.gc_info.write().unwrap();
-                gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
+                //gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
                 (Some(ancestor_lsn), Some(ancestor_timeline_id))
             } else {
                 (None, None)
@@ -1988,7 +1988,7 @@ async fn unoffload_timeline(
                 None => warn!("timeline already removed from offloaded timelines"),
             }
 
-            self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
+            //self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
 
             Arc::clone(timeline)
         };
@@ -3865,7 +3865,7 @@ fn initialize_gc_info(
                     return;
                 };
                 let ancestor_children = all_branchpoints.entry(*ancestor_timeline_id).or_default();
-                ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
+                //ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
             });
 
         // The number of bytes we always keep, irrespective of PITR: this is a constant across timelines

…'s parent (#9791) There is a potential data corruption issue, not one I've encountered, but it's still not hard to hit with some correct looking code given our current architecture. It has to do with the timeline's memory object storage via reference counted `Arc`s, and the removal of `retain_lsn` entries at the drop of the last `Arc` reference. The corruption steps are as follows: 1. timeline gets offloaded. timeline object A doesn't get dropped though, because some long-running task accesses it 2. the same timeline gets unoffloaded again. timeline object B gets created for it, timeline object A still referenced. both point to the same timeline. 3. the task keeping the reference to timeline object A exits. destructor for object A runs, removing `retain_lsn` in the timeline's parent. 4. the timeline's parent runs gc without the `retain_lsn` of the still exant timleine's child, leading to data corruption. In general we are susceptible each time when we recreate a `Timeline` object in the same process, which happens both during a timeline offload/unoffload cycle, as well as during an ancestor detach operation. The solution this PR implements is to make the destructor for a timeline as well as an offloaded timeline remove at most one `retain_lsn`. PR #9760 has added a log line to print the refcounts at timeline offload, but this only detects one of the places where we do such a recycle operation. Plus it doesn't prevent the actual issue. I doubt that this occurs in practice. It is more a defense in depth measure. Usually I'd assume that the timeline gets dropped immediately in step 1, as there is no background tasks referencing it after its shutdown. But one never knows, and reducing the stakes of step 1 actually occurring is a really good idea, from potential data corruption to waste of CPU time. Part of #8088

skyzh and others added 6 commits November 13, 2024 17:07

test(pageserver): check retain_lsn after offload

528749b

Signed-off-by: Alex Chi Z <[email protected]>

Only remove not offloaded child registration in Timeline::drop

304bfef

Register offloaded timeline's gc_info in parent

25b07f4

Fix test and register offloaded children in timeline unoffloading

b75ad03

Add an offload-no-restart mode

7128920

This doesn't actually reproduce any of the issues I've just fixed. Probably the timeline is still referenced somewhere and thus not dropped?

Print the remaining refcount during timeline offloading

99a959a

arpad-m requested a review from skyzh November 14, 2024 15:56

arpad-m requested a review from a team as a code owner November 14, 2024 15:56

arpad-m mentioned this pull request Nov 14, 2024

Epic: Pageserver Timeline Archival #8088

Open

skyzh approved these changes Nov 14, 2024

View reviewed changes

pageserver/src/tenant.rs Outdated Show resolved Hide resolved

pageserver/src/tenant/timeline.rs Show resolved Hide resolved

arpad-m force-pushed the arpad/retain_lsn_management branch from 14e6cfb to 99a959a Compare November 14, 2024 16:52

Add offload-parent-no-restart mode

c67cd22

arpad-m added 8 commits November 14, 2024 23:18

Make the test reproduce the issues

e793e52

Make the test work with offload-corrupt and with_intermediary

9eb1200

Not sure why we can read the basebackup, but shrug.

Merge remote-tracking branch 'origin/main' into arpad/retain_lsn_mana…

7ed9d95

…gement

Remove the debug assertion to avoid double panics

2cea69d

Python style fix

7f14935

Extend possible error cases

414fa36

Make more generic, sometimes the error is about "relation size" not "…

8ef5370

…block"

We drop the timeline here, so we need to also defuse it

cef74a5

arpad-m merged commit 7880c24 into main Nov 15, 2024
80 checks passed

arpad-m deleted the arpad/retain_lsn_management branch November 15, 2024 13:22

arpad-m mentioned this pull request Nov 18, 2024

Remove at most one retain_lsn entry from (possibly offloaded) timelne's parent #9791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct mistakes in offloaded timeline retain_lsn management #9760

Correct mistakes in offloaded timeline retain_lsn management #9760

arpad-m commented Nov 14, 2024 •

edited

Loading

skyzh left a comment

arpad-m commented Nov 14, 2024

skyzh commented Nov 14, 2024

arpad-m commented Nov 14, 2024

github-actions bot commented Nov 14, 2024 •

edited

Loading

Postgres 17

arpad-m commented Nov 15, 2024

Correct mistakes in offloaded timeline retain_lsn management #9760

Correct mistakes in offloaded timeline retain_lsn management #9760

Conversation

arpad-m commented Nov 14, 2024 • edited Loading

skyzh left a comment

Choose a reason for hiding this comment

arpad-m commented Nov 14, 2024

skyzh commented Nov 14, 2024

arpad-m commented Nov 14, 2024

github-actions bot commented Nov 14, 2024 • edited Loading

5490 tests run: 5247 passed, 0 failed, 243 skipped (full report)

Postgres 17

Code coverage* (full report)

arpad-m commented Nov 15, 2024

arpad-m commented Nov 14, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024 •

edited

Loading