-
Notifications
You must be signed in to change notification settings - Fork 463
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(pageserver): abort on duplicate layers, before doing damage (#7799)
fixes #7790 (duplicating most of the issue description here for posterity) # Background From the time before always-authoritative `index_part.json`, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen: https://github.com/neondatabase/neon/blob/a8e6d259cb49d1bf156dfc2215b92c04d1e8a08f/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md?plain=1#L41-L50 As of #5198 , we should not be exposed to that problem anymore. # Problem 1 We still have 1. [code in Pageserver](https://github.com/neondatabase/neon/blob/82960b2175211c0f666b91b5258c5e2253a245c7/pageserver/src/tenant/timeline.rs#L4502-L4521) than handles duplicate layers 2. [tests in the test suite](https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L15) that demonstrates the problem using a failpoint However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production. What is does instead is to return early with an `Ok()`, so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised. That "return early" would be a bug in the routine if it happened in production. So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior. # Problem 2 Further, if production code _did_ (it nowawdays doesn't!) create a duplicate layer, the code in Pageserver that handles the condition (item 1 above) is too little and too late: * the code handles it by discarding the newer `struct Layer`; that's good. * however, on disk, we have already overwritten the old with the new layer file * the fact that we do it atomically doesn't matter because ... * if the new layer file is not bit-identical, then we have a cache coherency problem * PS PageCache block cache: caches old bit battern * blob_io offsets stored in variables, based on pre-overwrite bit pattern / offsets * => reading based on these offsets from the new file might yield different data than before # Solution - Remove the test suite code pertaining to Problem 1 - Move & rename test suite code that actually tests RFC-27 crash-consistent layer map. - Remove the Pageserver code that handles duplicate layers too late (Problem 1) - Use `RENAME_NOREPLACE` to prevent over-rename the file during `.finish()`, bail with an error if it happens (Problem 2) - This bailing prevents the caller from even trying to insert into the layer map, as they don't even get a `struct Layer` at hand. - Add `abort`s in the place where we have the layer map lock and check for duplicates (Problem 2) - Note again, we can't reach there because we bail from `.finish()` much earlier in the code. - Share the logic to clean up after failed `.finish()` between image layers and delta layers (drive-by cleanup) - This exposed that test `image_layer_rewrite` was overwriting layer files in place. Fix the test. # Future Work This PR adds a new failure scenario that was previously "papered over" by the overwriting of layers: 1. Start a compaction that will produce 3 layers: A, B, C 2. Layer A is `finish()`ed successfully. 3. Layer B fails mid-way at some `put_value()`. 4. Compaction bails out, sleeps 20s. 5. Some disk space gets freed in the meantime. 6. Compaction wakes from sleep, another iteration starts, it attempts to write Layer A again. But the `.finish()` **fails because A already exists on disk**. The failure in step 5 is new with this PR, and it **causes the compaction to get stuck**. Before, it would silently overwrite the file and "successfully" complete the second iteration. The mitigation for this is to `/reset` the tenant.
- Loading branch information
Showing
11 changed files
with
252 additions
and
132 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
use nix::NixPath; | ||
|
||
/// Rename a file without replacing an existing file. | ||
/// | ||
/// This is a wrapper around platform-specific APIs. | ||
pub fn rename_noreplace<P1: ?Sized + NixPath, P2: ?Sized + NixPath>( | ||
src: &P1, | ||
dst: &P2, | ||
) -> nix::Result<()> { | ||
{ | ||
#[cfg(target_os = "linux")] | ||
{ | ||
nix::fcntl::renameat2( | ||
None, | ||
src, | ||
None, | ||
dst, | ||
nix::fcntl::RenameFlags::RENAME_NOREPLACE, | ||
) | ||
} | ||
#[cfg(target_os = "macos")] | ||
{ | ||
let res = src.with_nix_path(|src| { | ||
dst.with_nix_path(|dst| | ||
// SAFETY: `src` and `dst` are valid C strings as per the NixPath trait and they outlive the call to renamex_np. | ||
unsafe { | ||
nix::libc::renamex_np(src.as_ptr(), dst.as_ptr(), nix::libc::RENAME_EXCL) | ||
}) | ||
})??; | ||
nix::errno::Errno::result(res).map(drop) | ||
} | ||
#[cfg(not(any(target_os = "linux", target_os = "macos")))] | ||
{ | ||
std::compile_error!("OS does not support no-replace renames"); | ||
} | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod test { | ||
use std::{fs, path::PathBuf}; | ||
|
||
use super::*; | ||
|
||
fn testdir() -> camino_tempfile::Utf8TempDir { | ||
match crate::env::var("NEON_UTILS_RENAME_NOREPLACE_TESTDIR") { | ||
Some(path) => { | ||
let path: camino::Utf8PathBuf = path; | ||
camino_tempfile::tempdir_in(path).unwrap() | ||
} | ||
None => camino_tempfile::tempdir().unwrap(), | ||
} | ||
} | ||
|
||
#[test] | ||
fn test_absolute_paths() { | ||
let testdir = testdir(); | ||
println!("testdir: {}", testdir.path()); | ||
|
||
let src = testdir.path().join("src"); | ||
let dst = testdir.path().join("dst"); | ||
|
||
fs::write(&src, b"").unwrap(); | ||
fs::write(&dst, b"").unwrap(); | ||
|
||
let src = src.canonicalize().unwrap(); | ||
assert!(src.is_absolute()); | ||
let dst = dst.canonicalize().unwrap(); | ||
assert!(dst.is_absolute()); | ||
|
||
let result = rename_noreplace(&src, &dst); | ||
assert_eq!(result.unwrap_err(), nix::Error::EEXIST); | ||
} | ||
|
||
#[test] | ||
fn test_relative_paths() { | ||
let testdir = testdir(); | ||
println!("testdir: {}", testdir.path()); | ||
|
||
// this is fine because we run in nextest => process per test | ||
std::env::set_current_dir(testdir.path()).unwrap(); | ||
|
||
let src = PathBuf::from("src"); | ||
let dst = PathBuf::from("dst"); | ||
|
||
fs::write(&src, b"").unwrap(); | ||
fs::write(&dst, b"").unwrap(); | ||
|
||
let result = rename_noreplace(&src, &dst); | ||
assert_eq!(result.unwrap_err(), nix::Error::EEXIST); | ||
} | ||
|
||
#[test] | ||
fn test_works_when_not_exists() { | ||
let testdir = testdir(); | ||
println!("testdir: {}", testdir.path()); | ||
|
||
let src = testdir.path().join("src"); | ||
let dst = testdir.path().join("dst"); | ||
|
||
fs::write(&src, b"content").unwrap(); | ||
|
||
rename_noreplace(src.as_std_path(), dst.as_std_path()).unwrap(); | ||
assert_eq!( | ||
"content", | ||
String::from_utf8(std::fs::read(&dst).unwrap()).unwrap() | ||
); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
17116f2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3268 tests run: 3115 passed, 1 failed, 152 skipped (full report)
Failures on Postgres 14
test_basebackup_with_high_slru_count[github-actions-selfhosted-vectored-10-13-30]
: releaseFlaky tests (1)
Postgres 15
test_vm_bit_clear_on_heap_lock
: debugCode coverage* (full report)
functions
:31.5% (6585 of 20893 functions)
lines
:48.5% (50987 of 105136 lines)
* collected from Rust tests only
17116f2 at 2024-06-04T17:38:47.952Z :recycle: