-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct mistakes in offloaded timeline retain_lsn management #9760
Conversation
Signed-off-by: Alex Chi Z <[email protected]>
This doesn't actually reproduce any of the issues I've just fixed. Probably the timeline is still referenced somewhere and thus not dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; overall I feel we might want to revisit the retain_lsn mechanism, manipulating it in Drop
doesn't seem like a good idea... As we get more and more timeline states, in the future, probably it would be a good idea to re-compute it every time before gc/compaction?
I actually thought that this was already done, due to #9308. Personally I'd be fine with that as well. |
But this is done in |
yeah, that's what I wanted to say: i mistakenly assumed this when I wrote #9308. Now I know better :) |
14e6cfb
to
99a959a
Compare
5490 tests run: 5247 passed, 0 failed, 243 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
cef74a5 at 2024-11-15T12:59:31.645Z :recycle: |
Not sure why we can read the basebackup, but shrug.
Documenting the test failures if I comment out the offloading-specific commented out
commented out
commented out
diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 909f99ea9..27de8c2ba 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -542,7 +542,7 @@ fn from_timeline(timeline: &Timeline) -> Result<Self, UploadQueueNotReadyError>
let ancestor_lsn = timeline.get_ancestor_lsn();
let ancestor_timeline_id = ancestor_timeline.timeline_id;
let mut gc_info = ancestor_timeline.gc_info.write().unwrap();
- gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
+ //gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
(Some(ancestor_lsn), Some(ancestor_timeline_id))
} else {
(None, None)
@@ -1988,7 +1988,7 @@ async fn unoffload_timeline(
None => warn!("timeline already removed from offloaded timelines"),
}
- self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
+ //self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));
Arc::clone(timeline)
};
@@ -3865,7 +3865,7 @@ fn initialize_gc_info(
return;
};
let ancestor_children = all_branchpoints.entry(*ancestor_timeline_id).or_default();
- ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
+ //ancestor_children.push((retain_lsn, *timeline_id, MaybeOffloaded::Yes));
});
// The number of bytes we always keep, irrespective of PITR: this is a constant across timelines |
…'s parent (#9791) There is a potential data corruption issue, not one I've encountered, but it's still not hard to hit with some correct looking code given our current architecture. It has to do with the timeline's memory object storage via reference counted `Arc`s, and the removal of `retain_lsn` entries at the drop of the last `Arc` reference. The corruption steps are as follows: 1. timeline gets offloaded. timeline object A doesn't get dropped though, because some long-running task accesses it 2. the same timeline gets unoffloaded again. timeline object B gets created for it, timeline object A still referenced. both point to the same timeline. 3. the task keeping the reference to timeline object A exits. destructor for object A runs, removing `retain_lsn` in the timeline's parent. 4. the timeline's parent runs gc without the `retain_lsn` of the still exant timleine's child, leading to data corruption. In general we are susceptible each time when we recreate a `Timeline` object in the same process, which happens both during a timeline offload/unoffload cycle, as well as during an ancestor detach operation. The solution this PR implements is to make the destructor for a timeline as well as an offloaded timeline remove at most one `retain_lsn`. PR #9760 has added a log line to print the refcounts at timeline offload, but this only detects one of the places where we do such a recycle operation. Plus it doesn't prevent the actual issue. I doubt that this occurs in practice. It is more a defense in depth measure. Usually I'd assume that the timeline gets dropped immediately in step 1, as there is no background tasks referencing it after its shutdown. But one never knows, and reducing the stakes of step 1 actually occurring is a really good idea, from potential data corruption to waste of CPU time. Part of #8088
PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of
retain_lsn
values. However, there is more places than just tenant activation where one needs to update theretain_lsn
s.This PR fixes some bugs of the current code that could lead to corruption in the worst case:
retain_lsn
purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the tenant is not good either, because we might drop the offloaded timeline in a context where aoffloaded_timelines
lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline.retain_lsn
values populated, leading to it maybe garbage collecting values that its children might need. We now callinitialize_gc_info
on the parent.retain_lsn
values registered as offloaded at the parent. So if we drop theTimeline
object, and its registration is removed, the parent would not have any of the child'sretain_lsn
s around. Also, before, theTimeline
object would delete anything related to its timeline ID, now it only deletesretain_lsn
s that haveMaybeOffloaded::No
set.Incorporates Chi's reproducer from #9753. cc https://github.com/neondatabase/cloud/issues/20199
The
test_timeline_retain_lsn
test is extended:offload-parent
, which tests the second point, andoffload-no-restart
which tests the third point.It's easy to verify the test actually is "sharp" by removing one of the respective
self.initialize_gc_info()
,gc_info.insert_child()
orancestor_children.push()
.Part of #8088