Skip to content

Commit

Permalink
fix(pageserver): abort on duplicate layers, before doing damage (#7799)
Browse files Browse the repository at this point in the history
fixes #7790 (duplicating most
of the issue description here for posterity)

# Background

From the time before always-authoritative `index_part.json`, we had to
handle duplicate layers. See the RFC for an illustration of how
duplicate layers could happen:
https://github.com/neondatabase/neon/blob/a8e6d259cb49d1bf156dfc2215b92c04d1e8a08f/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md?plain=1#L41-L50

As of #5198 , we should not be exposed to that problem anymore.

# Problem 1

We still have
1. [code in
Pageserver](https://github.com/neondatabase/neon/blob/82960b2175211c0f666b91b5258c5e2253a245c7/pageserver/src/tenant/timeline.rs#L4502-L4521)
than handles duplicate layers
2. [tests in the test
suite](https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L15)
that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce
a crash that could legitimately happen in production.
What is does instead is to return early with an `Ok()`, so that the code
in Pageserver that handles duplicate layers (item 1) actually gets
exercised.

That "return early" would be a bug in the routine if it happened in
production.
So, the tests in the test suite are tests for their own sake, but don't
serve to actually regress-test any production behavior.

# Problem 2

Further, if production code _did_ (it nowawdays doesn't!) create a
duplicate layer, the code in Pageserver that handles the condition (item
1 above) is too little and too late:

* the code handles it by discarding the newer `struct Layer`; that's
good.
* however, on disk, we have already overwritten the old with the new
layer file
* the fact that we do it atomically doesn't matter because ...
* if the new layer file is not bit-identical, then we have a cache
coherency problem
  * PS PageCache block cache: caches old bit battern
* blob_io offsets stored in variables, based on pre-overwrite bit
pattern / offsets
* => reading based on these offsets from the new file might yield
different data than before
 
# Solution

- Remove the test suite code pertaining to Problem 1
- Move & rename test suite code that actually tests RFC-27
crash-consistent layer map.
- Remove the Pageserver code that handles duplicate layers too late
(Problem 1)
- Use `RENAME_NOREPLACE` to prevent over-rename the file during
`.finish()`, bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the
layer map, as they don't even get a `struct Layer` at hand.
- Add `abort`s in the place where we have the layer map lock and check
for duplicates (Problem 2)
- Note again, we can't reach there because we bail from `.finish()` much
earlier in the code.
- Share the logic to clean up after failed `.finish()` between image
layers and delta layers (drive-by cleanup)
- This exposed that test `image_layer_rewrite` was overwriting layer
files in place. Fix the test.

# Future Work

This PR adds a new failure scenario that was previously "papered over"
by the overwriting of layers:
1. Start a compaction that will produce 3 layers: A, B, C
2. Layer A is `finish()`ed successfully.
3. Layer B fails mid-way at some `put_value()`.
4. Compaction bails out, sleeps 20s.
5. Some disk space gets freed in the meantime.
6. Compaction wakes from sleep, another iteration starts, it attempts to
write Layer A again. But the `.finish()` **fails because A already
exists on disk**.

The failure in step 5 is new with this PR, and it **causes the
compaction to get stuck**.
Before, it would silently overwrite the file and "successfully" complete
the second iteration.

The mitigation for this is to `/reset` the tenant.
  • Loading branch information
problame authored Jun 4, 2024
1 parent fd22fc5 commit 17116f2
Show file tree
Hide file tree
Showing 11 changed files with 252 additions and 132 deletions.
3 changes: 3 additions & 0 deletions libs/utils/src/fs_ext.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ use std::{fs, io, path::Path};

use anyhow::Context;

mod rename_noreplace;
pub use rename_noreplace::rename_noreplace;

pub trait PathExt {
/// Returns an error if `self` is not a directory.
fn is_empty_dir(&self) -> io::Result<bool>;
Expand Down
109 changes: 109 additions & 0 deletions libs/utils/src/fs_ext/rename_noreplace.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
use nix::NixPath;

/// Rename a file without replacing an existing file.
///
/// This is a wrapper around platform-specific APIs.
pub fn rename_noreplace<P1: ?Sized + NixPath, P2: ?Sized + NixPath>(
src: &P1,
dst: &P2,
) -> nix::Result<()> {
{
#[cfg(target_os = "linux")]
{
nix::fcntl::renameat2(
None,
src,
None,
dst,
nix::fcntl::RenameFlags::RENAME_NOREPLACE,
)
}
#[cfg(target_os = "macos")]
{
let res = src.with_nix_path(|src| {
dst.with_nix_path(|dst|
// SAFETY: `src` and `dst` are valid C strings as per the NixPath trait and they outlive the call to renamex_np.
unsafe {
nix::libc::renamex_np(src.as_ptr(), dst.as_ptr(), nix::libc::RENAME_EXCL)
})
})??;
nix::errno::Errno::result(res).map(drop)
}
#[cfg(not(any(target_os = "linux", target_os = "macos")))]
{
std::compile_error!("OS does not support no-replace renames");
}
}
}

#[cfg(test)]
mod test {
use std::{fs, path::PathBuf};

use super::*;

fn testdir() -> camino_tempfile::Utf8TempDir {
match crate::env::var("NEON_UTILS_RENAME_NOREPLACE_TESTDIR") {
Some(path) => {
let path: camino::Utf8PathBuf = path;
camino_tempfile::tempdir_in(path).unwrap()
}
None => camino_tempfile::tempdir().unwrap(),
}
}

#[test]
fn test_absolute_paths() {
let testdir = testdir();
println!("testdir: {}", testdir.path());

let src = testdir.path().join("src");
let dst = testdir.path().join("dst");

fs::write(&src, b"").unwrap();
fs::write(&dst, b"").unwrap();

let src = src.canonicalize().unwrap();
assert!(src.is_absolute());
let dst = dst.canonicalize().unwrap();
assert!(dst.is_absolute());

let result = rename_noreplace(&src, &dst);
assert_eq!(result.unwrap_err(), nix::Error::EEXIST);
}

#[test]
fn test_relative_paths() {
let testdir = testdir();
println!("testdir: {}", testdir.path());

// this is fine because we run in nextest => process per test
std::env::set_current_dir(testdir.path()).unwrap();

let src = PathBuf::from("src");
let dst = PathBuf::from("dst");

fs::write(&src, b"").unwrap();
fs::write(&dst, b"").unwrap();

let result = rename_noreplace(&src, &dst);
assert_eq!(result.unwrap_err(), nix::Error::EEXIST);
}

#[test]
fn test_works_when_not_exists() {
let testdir = testdir();
println!("testdir: {}", testdir.path());

let src = testdir.path().join("src");
let dst = testdir.path().join("dst");

fs::write(&src, b"content").unwrap();

rename_noreplace(src.as_std_path(), dst.as_std_path()).unwrap();
assert_eq!(
"content",
String::from_utf8(std::fs::read(&dst).unwrap()).unwrap()
);
}
}
35 changes: 28 additions & 7 deletions pageserver/src/tenant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3865,6 +3865,9 @@ pub(crate) mod harness {
pub fn create_custom(
test_name: &'static str,
tenant_conf: TenantConf,
tenant_id: TenantId,
shard_identity: ShardIdentity,
generation: Generation,
) -> anyhow::Result<Self> {
setup_logging();

Expand All @@ -3877,8 +3880,12 @@ pub(crate) mod harness {
// OK in a test.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));

let tenant_id = TenantId::generate();
let tenant_shard_id = TenantShardId::unsharded(tenant_id);
let shard = shard_identity.shard_index();
let tenant_shard_id = TenantShardId {
tenant_id,
shard_number: shard.shard_number,
shard_count: shard.shard_count,
};
fs::create_dir_all(conf.tenant_path(&tenant_shard_id))?;
fs::create_dir_all(conf.timelines_path(&tenant_shard_id))?;

Expand All @@ -3896,8 +3903,8 @@ pub(crate) mod harness {
conf,
tenant_conf,
tenant_shard_id,
generation: Generation::new(0xdeadbeef),
shard: ShardIndex::unsharded(),
generation,
shard,
remote_storage,
remote_fs_dir,
deletion_queue,
Expand All @@ -3912,8 +3919,15 @@ pub(crate) mod harness {
compaction_period: Duration::ZERO,
..TenantConf::default()
};

Self::create_custom(test_name, tenant_conf)
let tenant_id = TenantId::generate();
let shard = ShardIdentity::unsharded();
Self::create_custom(
test_name,
tenant_conf,
tenant_id,
shard,
Generation::new(0xdeadbeef),
)
}

pub fn span(&self) -> tracing::Span {
Expand Down Expand Up @@ -4037,6 +4051,7 @@ mod tests {
use tests::storage_layer::ValuesReconstructState;
use tests::timeline::{GetVectoredError, ShutdownMode};
use utils::bin_ser::BeSer;
use utils::id::TenantId;

static TEST_KEY: Lazy<Key> =
Lazy::new(|| Key::from_slice(&hex!("010000000033333333444444445500000001")));
Expand Down Expand Up @@ -4936,7 +4951,13 @@ mod tests {
..TenantConf::default()
};

let harness = TenantHarness::create_custom("test_get_vectored_key_gap", tenant_conf)?;
let harness = TenantHarness::create_custom(
"test_get_vectored_key_gap",
tenant_conf,
TenantId::generate(),
ShardIdentity::unsharded(),
Generation::new(0xdeadbeef),
)?;
let (tenant, ctx) = harness.load().await;

let mut current_key = Key::from_hex("010000000033333333444444445500000000").unwrap();
Expand Down
35 changes: 22 additions & 13 deletions pageserver/src/tenant/storage_layer/delta_layer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -478,6 +478,23 @@ impl DeltaLayerWriterInner {
key_end: Key,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let temp_path = self.path.clone();
let result = self.finish0(key_end, timeline, ctx).await;
if result.is_err() {
tracing::info!(%temp_path, "cleaning up temporary file after error during writing");
if let Err(e) = std::fs::remove_file(&temp_path) {
tracing::warn!(error=%e, %temp_path, "error cleaning up temporary layer file after error during writing");
}
}
result
}

async fn finish0(
self,
key_end: Key,
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
Expand Down Expand Up @@ -651,19 +668,11 @@ impl DeltaLayerWriter {
timeline: &Arc<Timeline>,
ctx: &RequestContext,
) -> anyhow::Result<ResidentLayer> {
let inner = self.inner.take().unwrap();
let temp_path = inner.path.clone();
let result = inner.finish(key_end, timeline, ctx).await;
// The delta layer files can sometimes be really large. Clean them up.
if result.is_err() {
tracing::warn!(
"Cleaning up temporary delta file {temp_path} after error during writing"
);
if let Err(e) = std::fs::remove_file(&temp_path) {
tracing::warn!("Error cleaning up temporary delta layer file {temp_path}: {e:?}")
}
}
result
self.inner
.take()
.unwrap()
.finish(key_end, timeline, ctx)
.await
}
}

Expand Down
92 changes: 76 additions & 16 deletions pageserver/src/tenant/storage_layer/image_layer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -917,26 +917,57 @@ impl Drop for ImageLayerWriter {

#[cfg(test)]
mod test {
use std::time::Duration;

use bytes::Bytes;
use pageserver_api::{
key::Key,
shard::{ShardCount, ShardIdentity, ShardNumber, ShardStripeSize},
};
use utils::{id::TimelineId, lsn::Lsn};
use utils::{
generation::Generation,
id::{TenantId, TimelineId},
lsn::Lsn,
};

use crate::{tenant::harness::TenantHarness, DEFAULT_PG_VERSION};
use crate::{
tenant::{config::TenantConf, harness::TenantHarness},
DEFAULT_PG_VERSION,
};

use super::ImageLayerWriter;

#[tokio::test]
async fn image_layer_rewrite() {
let harness = TenantHarness::create("test_image_layer_rewrite").unwrap();
let (tenant, ctx) = harness.load().await;

let tenant_conf = TenantConf {
gc_period: Duration::ZERO,
compaction_period: Duration::ZERO,
..TenantConf::default()
};
let tenant_id = TenantId::generate();
let mut gen = Generation::new(0xdead0001);
let mut get_next_gen = || {
let ret = gen;
gen = gen.next();
ret
};
// The LSN at which we will create an image layer to filter
let lsn = Lsn(0xdeadbeef0000);

let timeline_id = TimelineId::generate();

//
// Create an unsharded parent with a layer.
//

let harness = TenantHarness::create_custom(
"test_image_layer_rewrite--parent",
tenant_conf.clone(),
tenant_id,
ShardIdentity::unsharded(),
get_next_gen(),
)
.unwrap();
let (tenant, ctx) = harness.load().await;
let timeline = tenant
.create_test_timeline(timeline_id, lsn, DEFAULT_PG_VERSION, &ctx)
.await
Expand Down Expand Up @@ -971,9 +1002,47 @@ mod test {
};
let original_size = resident.metadata().file_size;

//
// Create child shards and do the rewrite, exercising filter().
// TODO: abstraction in TenantHarness for splits.
//

// Filter for various shards: this exercises cases like values at start of key range, end of key
// range, middle of key range.
for shard_number in 0..4 {
let shard_count = ShardCount::new(4);
for shard_number in 0..shard_count.count() {
//
// mimic the shard split
//
let shard_identity = ShardIdentity::new(
ShardNumber(shard_number),
shard_count,
ShardStripeSize(0x8000),
)
.unwrap();
let harness = TenantHarness::create_custom(
Box::leak(Box::new(format!(
"test_image_layer_rewrite--child{}",
shard_identity.shard_slug()
))),
tenant_conf.clone(),
tenant_id,
shard_identity,
// NB: in reality, the shards would each fork off their own gen number sequence from the parent.
// But here, all we care about is that the gen number is unique.
get_next_gen(),
)
.unwrap();
let (tenant, ctx) = harness.load().await;
let timeline = tenant
.create_test_timeline(timeline_id, lsn, DEFAULT_PG_VERSION, &ctx)
.await
.unwrap();

//
// use filter() and make assertions
//

let mut filtered_writer = ImageLayerWriter::new(
harness.conf,
timeline_id,
Expand All @@ -985,15 +1054,6 @@ mod test {
.await
.unwrap();

// TenantHarness gave us an unsharded tenant, but we'll use a sharded ShardIdentity
// to exercise filter()
let shard_identity = ShardIdentity::new(
ShardNumber(shard_number),
ShardCount::new(4),
ShardStripeSize(0x8000),
)
.unwrap();

let wrote_keys = resident
.filter(&shard_identity, &mut filtered_writer, &ctx)
.await
Expand Down
7 changes: 4 additions & 3 deletions pageserver/src/tenant/storage_layer/layer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -277,9 +277,10 @@ impl Layer {

let downloaded = resident.expect("just initialized");

// if the rename works, the path is as expected
// TODO: sync system call
std::fs::rename(temp_path, owner.local_path())
// We never want to overwrite an existing file, so we use `RENAME_NOREPLACE`.
// TODO: this leaves the temp file in place if the rename fails, risking us running
// out of space. Should we clean it up here or does the calling context deal with this?
utils::fs_ext::rename_noreplace(temp_path.as_std_path(), owner.local_path().as_std_path())
.with_context(|| format!("rename temporary file as correct path for {owner}"))?;

Ok(ResidentLayer { downloaded, owner })
Expand Down
Loading

1 comment on commit 17116f2

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3268 tests run: 3115 passed, 1 failed, 152 skipped (full report)


Failures on Postgres 14

  • test_basebackup_with_high_slru_count[github-actions-selfhosted-vectored-10-13-30]: release
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_basebackup_with_high_slru_count[release-pg14-github-actions-selfhosted-vectored-10-13-30]"
Flaky tests (1)

Postgres 15

  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 31.5% (6585 of 20893 functions)
  • lines: 48.5% (50987 of 105136 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
17116f2 at 2024-06-04T17:38:47.952Z :recycle:

Please sign in to comment.