Skip to content

Commit

Permalink
Correct mistakes in offloaded timeline retain_lsn management (#9760)
Browse files Browse the repository at this point in the history
PR #9308 has modified tenant activation code to take offloaded child
timelines into account for populating the list of `retain_lsn` values.
However, there is more places than just tenant activation where one
needs to update the `retain_lsn`s.

This PR fixes some bugs of the current code that could lead to
corruption in the worst case:

1. Deleting of an offloaded timeline would not get its `retain_lsn`
purged from its parent. With the patch we now do it, but as the parent
can be offloaded as well, the situatoin is a bit trickier than for
non-offloaded timelines which can just keep a pointer to their parent.
Here we can't keep a pointer because the parent might get offloaded,
then unoffloaded again, creating a dangling pointer situation. Keeping a
pointer to the *tenant* is not good either, because we might drop the
offloaded timeline in a context where a `offloaded_timelines` lock is
already held: so we don't want to acquire a lock in the drop code of
OffloadedTimeline.
2. Unoffloading a timeline would not get its `retain_lsn` values
populated, leading to it maybe garbage collecting values that its
children might need. We now call `initialize_gc_info` on the parent.
3. Offloading of a timeline would not get its `retain_lsn` values
registered as offloaded at the parent. So if we drop the `Timeline`
object, and its registration is removed, the parent would not have any
of the child's `retain_lsn`s around. Also, before, the `Timeline` object
would delete anything related to its timeline ID, now it only deletes
`retain_lsn`s that have `MaybeOffloaded::No` set.

Incorporates Chi's reproducer from #9753. cc
neondatabase/cloud#20199

The `test_timeline_retain_lsn` test is extended:

1. it gains a new dimension, duplicating each mode, to either have the
"main" branch be the direct parent of the timeline we archive, or the
"test_archived_parent" branch intermediary, creating a three timeline
structure. This doesn't test anything fixed by this PR in particular,
just explores the vast space of possible configurations a little bit
more.
2. it gains two new modes, `offload-parent`, which tests the second
point, and `offload-no-restart` which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the
respective `self.initialize_gc_info()`, `gc_info.insert_child()` or
`ancestor_children.push()`.

Part of #8088

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Alex Chi Z <[email protected]>
  • Loading branch information
arpad-m and skyzh authored Nov 15, 2024
1 parent 04938d9 commit 7880c24
Show file tree
Hide file tree
Showing 5 changed files with 244 additions and 47 deletions.
152 changes: 138 additions & 14 deletions pageserver/src/tenant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ use remote_timeline_client::UploadQueueNotReadyError;
use std::collections::BTreeMap;
use std::fmt;
use std::future::Future;
use std::sync::atomic::AtomicBool;
use std::sync::Weak;
use std::time::SystemTime;
use storage_broker::BrokerClientChannel;
Expand Down Expand Up @@ -524,6 +525,9 @@ pub struct OffloadedTimeline {
/// Prevent two tasks from deleting the timeline at the same time. If held, the
/// timeline is being deleted. If 'true', the timeline has already been deleted.
pub delete_progress: TimelineDeleteProgress,

/// Part of the `OffloadedTimeline` object's lifecycle: this needs to be set before we drop it
pub deleted_from_ancestor: AtomicBool,
}

impl OffloadedTimeline {
Expand All @@ -533,24 +537,34 @@ impl OffloadedTimeline {
/// the timeline is not in a stopped state.
/// Panics if the timeline is not archived.
fn from_timeline(timeline: &Timeline) -> Result<Self, UploadQueueNotReadyError> {
let ancestor_retain_lsn = timeline
.get_ancestor_timeline_id()
.map(|_timeline_id| timeline.get_ancestor_lsn());
let (ancestor_retain_lsn, ancestor_timeline_id) =
if let Some(ancestor_timeline) = timeline.ancestor_timeline() {
let ancestor_lsn = timeline.get_ancestor_lsn();
let ancestor_timeline_id = ancestor_timeline.timeline_id;
let mut gc_info = ancestor_timeline.gc_info.write().unwrap();
gc_info.insert_child(timeline.timeline_id, ancestor_lsn, MaybeOffloaded::Yes);
(Some(ancestor_lsn), Some(ancestor_timeline_id))
} else {
(None, None)
};
let archived_at = timeline
.remote_client
.archived_at_stopped_queue()?
.expect("must be called on an archived timeline");
Ok(Self {
tenant_shard_id: timeline.tenant_shard_id,
timeline_id: timeline.timeline_id,
ancestor_timeline_id: timeline.get_ancestor_timeline_id(),
ancestor_timeline_id,
ancestor_retain_lsn,
archived_at,

delete_progress: timeline.delete_progress.clone(),
deleted_from_ancestor: AtomicBool::new(false),
})
}
fn from_manifest(tenant_shard_id: TenantShardId, manifest: &OffloadedTimelineManifest) -> Self {
// We expect to reach this case in tenant loading, where the `retain_lsn` is populated in the parent's `gc_info`
// by the `initialize_gc_info` function.
let OffloadedTimelineManifest {
timeline_id,
ancestor_timeline_id,
Expand All @@ -564,6 +578,7 @@ impl OffloadedTimeline {
ancestor_retain_lsn,
archived_at,
delete_progress: TimelineDeleteProgress::default(),
deleted_from_ancestor: AtomicBool::new(false),
}
}
fn manifest(&self) -> OffloadedTimelineManifest {
Expand All @@ -581,6 +596,33 @@ impl OffloadedTimeline {
archived_at: *archived_at,
}
}
/// Delete this timeline's retain_lsn from its ancestor, if present in the given tenant
fn delete_from_ancestor_with_timelines(
&self,
timelines: &std::sync::MutexGuard<'_, HashMap<TimelineId, Arc<Timeline>>>,
) {
if let (Some(_retain_lsn), Some(ancestor_timeline_id)) =
(self.ancestor_retain_lsn, self.ancestor_timeline_id)
{
if let Some((_, ancestor_timeline)) = timelines
.iter()
.find(|(tid, _tl)| **tid == ancestor_timeline_id)
{
ancestor_timeline
.gc_info
.write()
.unwrap()
.remove_child_offloaded(self.timeline_id);
}
}
self.deleted_from_ancestor.store(true, Ordering::Release);
}
/// Call [`Self::delete_from_ancestor_with_timelines`] instead if possible.
///
/// As the entire tenant is being dropped, don't bother deregistering the `retain_lsn` from the ancestor.
fn defuse_for_tenant_drop(&self) {
self.deleted_from_ancestor.store(true, Ordering::Release);
}
}

impl fmt::Debug for OffloadedTimeline {
Expand All @@ -589,6 +631,17 @@ impl fmt::Debug for OffloadedTimeline {
}
}

impl Drop for OffloadedTimeline {
fn drop(&mut self) {
if !self.deleted_from_ancestor.load(Ordering::Acquire) {
tracing::warn!(
"offloaded timeline {} was dropped without having cleaned it up at the ancestor",
self.timeline_id
);
}
}
}

#[derive(Copy, Clone, PartialEq, Eq, Hash, Debug)]
pub enum MaybeOffloaded {
Yes,
Expand Down Expand Up @@ -1531,14 +1584,15 @@ impl Tenant {
}
// Complete deletions for offloaded timeline id's.
offloaded_timelines_list
.retain(|(offloaded_id, _offloaded)| {
.retain(|(offloaded_id, offloaded)| {
// At this point, offloaded_timeline_ids has the list of all offloaded timelines
// without a prefix in S3, so they are inexistent.
// In the end, existence of a timeline is finally determined by the existence of an index-part.json in remote storage.
// If there is a dangling reference in another location, they need to be cleaned up.
let delete = offloaded_timeline_ids.contains(offloaded_id);
if delete {
tracing::info!("Removing offloaded timeline {offloaded_id} from manifest as no remote prefix was found");
offloaded.defuse_for_tenant_drop();
}
!delete
});
Expand Down Expand Up @@ -1927,9 +1981,15 @@ impl Tenant {
)));
};
let mut offloaded_timelines = self.timelines_offloaded.lock().unwrap();
if offloaded_timelines.remove(&timeline_id).is_none() {
warn!("timeline already removed from offloaded timelines");
match offloaded_timelines.remove(&timeline_id) {
Some(offloaded) => {
offloaded.delete_from_ancestor_with_timelines(&timelines);
}
None => warn!("timeline already removed from offloaded timelines"),
}

self.initialize_gc_info(&timelines, &offloaded_timelines, Some(timeline_id));

Arc::clone(timeline)
};

Expand Down Expand Up @@ -2667,7 +2727,7 @@ impl Tenant {
.filter(|timeline| !(timeline.is_broken() || timeline.is_stopping()));

// Before activation, populate each Timeline's GcInfo with information about its children
self.initialize_gc_info(&timelines_accessor, &timelines_offloaded_accessor);
self.initialize_gc_info(&timelines_accessor, &timelines_offloaded_accessor, None);

// Spawn gc and compaction loops. The loops will shut themselves
// down when they notice that the tenant is inactive.
Expand Down Expand Up @@ -2782,8 +2842,14 @@ impl Tenant {
let timeline_id = timeline.timeline_id;
let span = tracing::info_span!("timeline_shutdown", %timeline_id, ?shutdown_mode);
js.spawn(async move { timeline.shutdown(shutdown_mode).instrument(span).await });
})
};
});
}
{
let timelines_offloaded = self.timelines_offloaded.lock().unwrap();
timelines_offloaded.values().for_each(|timeline| {
timeline.defuse_for_tenant_drop();
});
}
// test_long_timeline_create_then_tenant_delete is leaning on this message
tracing::info!("Waiting for timelines...");
while let Some(res) = js.join_next().await {
Expand Down Expand Up @@ -3767,10 +3833,13 @@ impl Tenant {
&self,
timelines: &std::sync::MutexGuard<HashMap<TimelineId, Arc<Timeline>>>,
timelines_offloaded: &std::sync::MutexGuard<HashMap<TimelineId, Arc<OffloadedTimeline>>>,
restrict_to_timeline: Option<TimelineId>,
) {
// This function must be called before activation: after activation timeline create/delete operations
// might happen, and this function is not safe to run concurrently with those.
assert!(!self.is_active());
if restrict_to_timeline.is_none() {
// This function must be called before activation: after activation timeline create/delete operations
// might happen, and this function is not safe to run concurrently with those.
assert!(!self.is_active());
}

// Scan all timelines. For each timeline, remember the timeline ID and
// the branch point where it was created.
Expand Down Expand Up @@ -3803,7 +3872,12 @@ impl Tenant {
let horizon = self.get_gc_horizon();

// Populate each timeline's GcInfo with information about its child branches
for timeline in timelines.values() {
let timelines_to_write = if let Some(timeline_id) = restrict_to_timeline {
itertools::Either::Left(timelines.get(&timeline_id).into_iter())
} else {
itertools::Either::Right(timelines.values())
};
for timeline in timelines_to_write {
let mut branchpoints: Vec<(Lsn, TimelineId, MaybeOffloaded)> = all_branchpoints
.remove(&timeline.timeline_id)
.unwrap_or_default();
Expand Down Expand Up @@ -9650,4 +9724,54 @@ mod tests {

Ok(())
}

#[cfg(feature = "testing")]
#[tokio::test]
async fn test_timeline_offload_retain_lsn() -> anyhow::Result<()> {
let harness = TenantHarness::create("test_timeline_offload_retain_lsn")
.await
.unwrap();
let (tenant, ctx) = harness.load().await;
let tline_parent = tenant
.create_test_timeline(TIMELINE_ID, Lsn(0x10), DEFAULT_PG_VERSION, &ctx)
.await
.unwrap();
let tline_child = tenant
.branch_timeline_test(&tline_parent, NEW_TIMELINE_ID, Some(Lsn(0x20)), &ctx)
.await
.unwrap();
{
let gc_info_parent = tline_parent.gc_info.read().unwrap();
assert_eq!(
gc_info_parent.retain_lsns,
vec![(Lsn(0x20), tline_child.timeline_id, MaybeOffloaded::No)]
);
}
// We have to directly call the remote_client instead of using the archive function to avoid constructing broker client...
tline_child
.remote_client
.schedule_index_upload_for_timeline_archival_state(TimelineArchivalState::Archived)
.unwrap();
tline_child.remote_client.wait_completion().await.unwrap();
offload_timeline(&tenant, &tline_child)
.instrument(tracing::info_span!(parent: None, "offload_test", tenant_id=%"test", shard_id=%"test", timeline_id=%"test"))
.await.unwrap();
let child_timeline_id = tline_child.timeline_id;
Arc::try_unwrap(tline_child).unwrap();

{
let gc_info_parent = tline_parent.gc_info.read().unwrap();
assert_eq!(
gc_info_parent.retain_lsns,
vec![(Lsn(0x20), child_timeline_id, MaybeOffloaded::Yes)]
);
}

tenant
.get_offloaded_timeline(child_timeline_id)
.unwrap()
.defuse_for_tenant_drop();

Ok(())
}
}
31 changes: 22 additions & 9 deletions pageserver/src/tenant/timeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -477,8 +477,21 @@ impl GcInfo {
self.retain_lsns.sort_by_key(|i| i.0);
}

pub(super) fn remove_child(&mut self, child_id: TimelineId) {
self.retain_lsns.retain(|i| i.1 != child_id);
pub(super) fn remove_child_maybe_offloaded(
&mut self,
child_id: TimelineId,
maybe_offloaded: MaybeOffloaded,
) {
self.retain_lsns
.retain(|i| !(i.1 == child_id && i.2 == maybe_offloaded));
}

pub(super) fn remove_child_not_offloaded(&mut self, child_id: TimelineId) {
self.remove_child_maybe_offloaded(child_id, MaybeOffloaded::No);
}

pub(super) fn remove_child_offloaded(&mut self, child_id: TimelineId) {
self.remove_child_maybe_offloaded(child_id, MaybeOffloaded::Yes);
}
}

Expand Down Expand Up @@ -4501,7 +4514,7 @@ impl Drop for Timeline {
// This lock should never be poisoned, but in case it is we do a .map() instead of
// an unwrap(), to avoid panicking in a destructor and thereby aborting the process.
if let Ok(mut gc_info) = ancestor.gc_info.write() {
gc_info.remove_child(self.timeline_id)
gc_info.remove_child_not_offloaded(self.timeline_id)
}
}
}
Expand Down Expand Up @@ -5030,7 +5043,7 @@ impl Timeline {

// 1. Is it newer than GC horizon cutoff point?
if l.get_lsn_range().end > space_cutoff {
debug!(
info!(
"keeping {} because it's newer than space_cutoff {}",
l.layer_name(),
space_cutoff,
Expand All @@ -5041,7 +5054,7 @@ impl Timeline {

// 2. It is newer than PiTR cutoff point?
if l.get_lsn_range().end > time_cutoff {
debug!(
info!(
"keeping {} because it's newer than time_cutoff {}",
l.layer_name(),
time_cutoff,
Expand All @@ -5060,7 +5073,7 @@ impl Timeline {
for retain_lsn in &retain_lsns {
// start_lsn is inclusive
if &l.get_lsn_range().start <= retain_lsn {
debug!(
info!(
"keeping {} because it's still might be referenced by child branch forked at {} is_dropped: xx is_incremental: {}",
l.layer_name(),
retain_lsn,
Expand All @@ -5075,7 +5088,7 @@ impl Timeline {
if let Some(lsn) = &max_lsn_with_valid_lease {
// keep if layer start <= any of the lease
if &l.get_lsn_range().start <= lsn {
debug!(
info!(
"keeping {} because there is a valid lease preventing GC at {}",
l.layer_name(),
lsn,
Expand Down Expand Up @@ -5107,13 +5120,13 @@ impl Timeline {
if !layers
.image_layer_exists(&l.get_key_range(), &(l.get_lsn_range().end..new_gc_cutoff))
{
debug!("keeping {} because it is the latest layer", l.layer_name());
info!("keeping {} because it is the latest layer", l.layer_name());
result.layers_not_updated += 1;
continue 'outer;
}

// We didn't find any reason to keep this file, so remove it.
debug!(
info!(
"garbage collecting {} is_dropped: xx is_incremental: {}",
l.layer_name(),
l.is_incremental(),
Expand Down
3 changes: 2 additions & 1 deletion pageserver/src/tenant/timeline/delete.rs
Original file line number Diff line number Diff line change
Expand Up @@ -141,9 +141,10 @@ async fn remove_maybe_offloaded_timeline_from_tenant(
);
}
TimelineOrOffloaded::Offloaded(timeline) => {
timelines_offloaded
let offloaded_timeline = timelines_offloaded
.remove(&timeline.timeline_id)
.expect("timeline that we were deleting was concurrently removed from 'timelines_offloaded' map");
offloaded_timeline.delete_from_ancestor_with_timelines(&timelines);
}
}

Expand Down
Loading

1 comment on commit 7880c24

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5581 tests run: 5326 passed, 2 failed, 253 skipped (full report)


Failures on Postgres 16

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sharded_ingest[release-pg16-github-actions-selfhosted-1] or test_compaction_l0_memory[release-pg16-github-actions-selfhosted]"
Flaky tests (2)

Postgres 17

Code coverage* (full report)

  • functions: 31.9% (7929 of 24849 functions)
  • lines: 49.7% (62844 of 126497 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
7880c24 at 2024-11-15T15:11:07.778Z :recycle:

Please sign in to comment.