Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also consider offloaded timelines for obtaining retain_lsn #9308

Merged
merged 5 commits into from
Oct 14, 2024

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Oct 8, 2024

Also consider offloaded timelines for obtaining retain_lsn. This is required for correctness for all timelines that have not been flattened yet: otherwise we GC data that might still be required for reading.

This somewhat counteracts the original purpose of timeline offloading of not having to iterate over offloaded timelines, but sadly it's required. In the future, we can improve the way the offloaded timelines are stored.

We also make the retain_lsn optional so that in the future, when we implement flattening, we can make it None. This also applies to full timeline objects by the way, where it would probably make most sense to add a bool flag whether the timeline is successfully flattened, and if it is, one can exclude it from retain_lsn as well.

Also, track whether a timeline was offloaded or not in retain_lsn so that the retain_lsn can be excluded from visibility and size calculation.

Part of #8088

@arpad-m arpad-m requested a review from a team as a code owner October 8, 2024 07:32
@arpad-m arpad-m requested review from skyzh and removed request for a team October 8, 2024 07:32
Copy link

github-actions bot commented Oct 8, 2024

5184 tests run: 4977 passed, 0 failed, 207 skipped (full report)


Flaky tests (2)

Postgres 16

Code coverage* (full report)

  • functions: 31.4% (7548 of 24029 functions)
  • lines: 49.2% (60364 of 122755 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
16c3bdb at 2024-10-14T14:53:55.427Z :recycle:

Base automatically changed from arpad/timeline_offload to main October 8, 2024 23:33
@arpad-m arpad-m force-pushed the arpad/offloaded_retain_lsn branch from d583919 to 3a3b0b0 Compare October 8, 2024 23:35
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but let's give the questions below some thought.
Also, could you please update the cover letter to explain why the current code is not correct?


Also consider offloaded timelines for obtaining retain_lsn. This somewhat counteracts the original purpose of timeline offloading of not having to iterate over offloaded timelines, but sadly it's required for correctness.

Looking at retain_lsns usage I have a question. Do we actually need all lsns or just the lower bound? Not strictly related to your PR, but can solve the pain point you mention above.


Do we want branch points of offloaded branches to be visible? With this change we include them in the visibility calculation, but it doesn't feel right.

pageserver/src/tenant.rs Show resolved Hide resolved
pageserver/src/tenant.rs Show resolved Hide resolved
@arpad-m
Copy link
Member Author

arpad-m commented Oct 14, 2024

Do we want branch points of offloaded branches to be visible? With this change we include them in the visibility calculation, but it doesn't feel right.

At some point we want to unoffload and flatten the timeline, when the layers become visible again. But while it is offloaded, we don't want to keep the layers around.

As this is a different situation from gc, where the layers need to be kept, I now add a bool param to distinguish the two cases.

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
@arpad-m arpad-m merged commit f54e3e9 into main Oct 14, 2024
79 checks passed
@arpad-m arpad-m deleted the arpad/offloaded_retain_lsn branch October 14, 2024 15:54
arpad-m added a commit that referenced this pull request Nov 15, 2024
PR #9308 has modified tenant activation code to take offloaded child
timelines into account for populating the list of `retain_lsn` values.
However, there is more places than just tenant activation where one
needs to update the `retain_lsn`s.

This PR fixes some bugs of the current code that could lead to
corruption in the worst case:

1. Deleting of an offloaded timeline would not get its `retain_lsn`
purged from its parent. With the patch we now do it, but as the parent
can be offloaded as well, the situatoin is a bit trickier than for
non-offloaded timelines which can just keep a pointer to their parent.
Here we can't keep a pointer because the parent might get offloaded,
then unoffloaded again, creating a dangling pointer situation. Keeping a
pointer to the *tenant* is not good either, because we might drop the
offloaded timeline in a context where a `offloaded_timelines` lock is
already held: so we don't want to acquire a lock in the drop code of
OffloadedTimeline.
2. Unoffloading a timeline would not get its `retain_lsn` values
populated, leading to it maybe garbage collecting values that its
children might need. We now call `initialize_gc_info` on the parent.
3. Offloading of a timeline would not get its `retain_lsn` values
registered as offloaded at the parent. So if we drop the `Timeline`
object, and its registration is removed, the parent would not have any
of the child's `retain_lsn`s around. Also, before, the `Timeline` object
would delete anything related to its timeline ID, now it only deletes
`retain_lsn`s that have `MaybeOffloaded::No` set.

Incorporates Chi's reproducer from #9753. cc
neondatabase/cloud#20199

The `test_timeline_retain_lsn` test is extended:

1. it gains a new dimension, duplicating each mode, to either have the
"main" branch be the direct parent of the timeline we archive, or the
"test_archived_parent" branch intermediary, creating a three timeline
structure. This doesn't test anything fixed by this PR in particular,
just explores the vast space of possible configurations a little bit
more.
2. it gains two new modes, `offload-parent`, which tests the second
point, and `offload-no-restart` which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the
respective `self.initialize_gc_info()`, `gc_info.insert_child()` or
`ancestor_children.push()`.

Part of #8088

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Alex Chi Z <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants