Also consider offloaded timelines for obtaining retain_lsn #9308

arpad-m · 2024-10-08T07:32:00Z

Also consider offloaded timelines for obtaining retain_lsn. This is required for correctness for all timelines that have not been flattened yet: otherwise we GC data that might still be required for reading.

This somewhat counteracts the original purpose of timeline offloading of not having to iterate over offloaded timelines, but sadly it's required. In the future, we can improve the way the offloaded timelines are stored.

We also make the retain_lsn optional so that in the future, when we implement flattening, we can make it None. This also applies to full timeline objects by the way, where it would probably make most sense to add a bool flag whether the timeline is successfully flattened, and if it is, one can exclude it from retain_lsn as well.

Also, track whether a timeline was offloaded or not in retain_lsn so that the retain_lsn can be excluded from visibility and size calculation.

Part of #8088

github-actions · 2024-10-08T07:33:31Z

5184 tests run: 4977 passed, 0 failed, 207 skipped (full report)

Flaky tests (2)

Postgres 16

test_replica_query_race: release-x86-64
test_peer_recovery: release-arm64

Code coverage* (full report)

functions: 31.4% (7548 of 24029 functions)
lines: 49.2% (60364 of 122755 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
16c3bdb at 2024-10-14T14:53:55.427Z :recycle:}

VladLazar

Looks good, but let's give the questions below some thought.
Also, could you please update the cover letter to explain why the current code is not correct?

Also consider offloaded timelines for obtaining retain_lsn. This somewhat counteracts the original purpose of timeline offloading of not having to iterate over offloaded timelines, but sadly it's required for correctness.

Looking at retain_lsns usage I have a question. Do we actually need all lsns or just the lower bound? Not strictly related to your PR, but can solve the pain point you mention above.

Do we want branch points of offloaded branches to be visible? With this change we include them in the visibility calculation, but it doesn't feel right.

pageserver/src/tenant.rs

arpad-m · 2024-10-14T10:47:48Z

Do we want branch points of offloaded branches to be visible? With this change we include them in the visibility calculation, but it doesn't feel right.

At some point we want to unoffload and flatten the timeline, when the layers become visible again. But while it is offloaded, we don't want to keep the layers around.

As this is a different situation from gc, where the layers need to be kept, I now add a bool param to distinguish the two cases.

pageserver/src/tenant/timeline.rs

PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of `retain_lsn` values. However, there is more places than just tenant activation where one needs to update the `retain_lsn`s. This PR fixes some bugs of the current code that could lead to corruption in the worst case: 1. Deleting of an offloaded timeline would not get its `retain_lsn` purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the *tenant* is not good either, because we might drop the offloaded timeline in a context where a `offloaded_timelines` lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline. 2. Unoffloading a timeline would not get its `retain_lsn` values populated, leading to it maybe garbage collecting values that its children might need. We now call `initialize_gc_info` on the parent. 3. Offloading of a timeline would not get its `retain_lsn` values registered as offloaded at the parent. So if we drop the `Timeline` object, and its registration is removed, the parent would not have any of the child's `retain_lsn`s around. Also, before, the `Timeline` object would delete anything related to its timeline ID, now it only deletes `retain_lsn`s that have `MaybeOffloaded::No` set. Incorporates Chi's reproducer from #9753. cc neondatabase/cloud#20199 The `test_timeline_retain_lsn` test is extended: 1. it gains a new dimension, duplicating each mode, to either have the "main" branch be the direct parent of the timeline we archive, or the "test_archived_parent" branch intermediary, creating a three timeline structure. This doesn't test anything fixed by this PR in particular, just explores the vast space of possible configurations a little bit more. 2. it gains two new modes, `offload-parent`, which tests the second point, and `offload-no-restart` which tests the third point. It's easy to verify the test actually is "sharp" by removing one of the respective `self.initialize_gc_info()`, `gc_info.insert_child()` or `ancestor_children.push()`. Part of #8088 --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Alex Chi Z <[email protected]>

arpad-m requested a review from a team as a code owner October 8, 2024 07:32

arpad-m requested review from skyzh and removed request for a team October 8, 2024 07:32

arpad-m mentioned this pull request Oct 8, 2024

Epic: Pageserver Timeline Archival #8088

Open

Base automatically changed from arpad/timeline_offload to main October 8, 2024 23:33

Also consider offloaded timelines for obtaining retain_lsn

3a3b0b0

arpad-m force-pushed the arpad/offloaded_retain_lsn branch from d583919 to 3a3b0b0 Compare October 8, 2024 23:35

VladLazar reviewed Oct 9, 2024

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

pageserver/src/tenant.rs Show resolved Hide resolved

arpad-m added 2 commits October 14, 2024 11:25

Add is_offloaded flag

3a0f4f1

tests

17b9c12

VladLazar approved these changes Oct 14, 2024

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

arpad-m added 2 commits October 14, 2024 14:33

enum

4d5aaa0

Clippy and fmt

16c3bdb

arpad-m merged commit f54e3e9 into main Oct 14, 2024
79 checks passed

arpad-m deleted the arpad/offloaded_retain_lsn branch October 14, 2024 15:54

arpad-m mentioned this pull request Nov 14, 2024

Correct mistakes in offloaded timeline retain_lsn management #9760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Also consider offloaded timelines for obtaining retain_lsn #9308

Also consider offloaded timelines for obtaining retain_lsn #9308

arpad-m commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading

Postgres 16

VladLazar left a comment •

edited

Loading

arpad-m commented Oct 14, 2024

Also consider offloaded timelines for obtaining retain_lsn #9308

Also consider offloaded timelines for obtaining retain_lsn #9308

Conversation

arpad-m commented Oct 8, 2024 • edited Loading

github-actions bot commented Oct 8, 2024 • edited Loading

5184 tests run: 4977 passed, 0 failed, 207 skipped (full report)

Postgres 16

Code coverage* (full report)

VladLazar left a comment • edited Loading

Choose a reason for hiding this comment

arpad-m commented Oct 14, 2024

arpad-m commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading

VladLazar left a comment •

edited

Loading