-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Also consider offloaded timelines for obtaining retain_lsn #9308
Conversation
5184 tests run: 4977 passed, 0 failed, 207 skipped (full report)Flaky tests (2)Postgres 16
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
16c3bdb at 2024-10-14T14:53:55.427Z :recycle: |
d583919
to
3a3b0b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but let's give the questions below some thought.
Also, could you please update the cover letter to explain why the current code is not correct?
Also consider offloaded timelines for obtaining retain_lsn. This somewhat counteracts the original purpose of timeline offloading of not having to iterate over offloaded timelines, but sadly it's required for correctness.
Looking at retain_lsns
usage I have a question. Do we actually need all lsns or just the lower bound? Not strictly related to your PR, but can solve the pain point you mention above.
Do we want branch points of offloaded branches to be visible? With this change we include them in the visibility calculation, but it doesn't feel right.
At some point we want to unoffload and flatten the timeline, when the layers become visible again. But while it is offloaded, we don't want to keep the layers around. As this is a different situation from gc, where the layers need to be kept, I now add a bool param to distinguish the two cases. |
PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of `retain_lsn` values. However, there is more places than just tenant activation where one needs to update the `retain_lsn`s. This PR fixes some bugs of the current code that could lead to corruption in the worst case: 1. Deleting of an offloaded timeline would not get its `retain_lsn` purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the *tenant* is not good either, because we might drop the offloaded timeline in a context where a `offloaded_timelines` lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline. 2. Unoffloading a timeline would not get its `retain_lsn` values populated, leading to it maybe garbage collecting values that its children might need. We now call `initialize_gc_info` on the parent. 3. Offloading of a timeline would not get its `retain_lsn` values registered as offloaded at the parent. So if we drop the `Timeline` object, and its registration is removed, the parent would not have any of the child's `retain_lsn`s around. Also, before, the `Timeline` object would delete anything related to its timeline ID, now it only deletes `retain_lsn`s that have `MaybeOffloaded::No` set. Incorporates Chi's reproducer from #9753. cc neondatabase/cloud#20199 The `test_timeline_retain_lsn` test is extended: 1. it gains a new dimension, duplicating each mode, to either have the "main" branch be the direct parent of the timeline we archive, or the "test_archived_parent" branch intermediary, creating a three timeline structure. This doesn't test anything fixed by this PR in particular, just explores the vast space of possible configurations a little bit more. 2. it gains two new modes, `offload-parent`, which tests the second point, and `offload-no-restart` which tests the third point. It's easy to verify the test actually is "sharp" by removing one of the respective `self.initialize_gc_info()`, `gc_info.insert_child()` or `ancestor_children.push()`. Part of #8088 --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Alex Chi Z <[email protected]>
Also consider offloaded timelines for obtaining
retain_lsn
. This is required for correctness for all timelines that have not been flattened yet: otherwise we GC data that might still be required for reading.This somewhat counteracts the original purpose of timeline offloading of not having to iterate over offloaded timelines, but sadly it's required. In the future, we can improve the way the offloaded timelines are stored.
We also make the
retain_lsn
optional so that in the future, when we implement flattening, we can make it None. This also applies to full timeline objects by the way, where it would probably make most sense to add a bool flag whether the timeline is successfully flattened, and if it is, one can exclude it fromretain_lsn
as well.Also, track whether a timeline was offloaded or not in
retain_lsn
so that theretain_lsn
can be excluded from visibility and size calculation.Part of #8088