Skip to content

Commit

Permalink
layer file creation: fatal_err on timeline dir fsync (#6985)
Browse files Browse the repository at this point in the history
As pointed out in the comments added in this PR:
the in-memory state of the filesystem already has the layer file in its
final place.
If the fsync fails, but pageserver continues to execute, it's quite easy
for subsequent pageserver code to observe the file being there and
assume it's durable, when it really isn't.

It can happen that we get ENOSPC during the fsync.
However,
1. the timeline dir is small (remember, the big layer _file_ has already
been synced).
Small data means ENOSPC due to delayed allocation races etc are less
likely.
2. what else are we going to do in that case?

If we decide to bubble up the error, the file remains on disk.
We could try to unlink it and fsync after the unlink.
If that fails, we would _definitely_ need to error out.
Is it worth the trouble though?

Side note: all this logic about not carrying on after fsync failure
implies that we `sync` the filesystem successfully before we restart
the pageserver. We don't do that right now, but should (=>
#6989)

part of #6663
  • Loading branch information
problame authored Mar 4, 2024
1 parent 6e46204 commit c861d71
Showing 1 changed file with 18 additions and 4 deletions.
22 changes: 18 additions & 4 deletions pageserver/src/tenant/timeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::sync::gate::{Gate, GateGuard};

use crate::pgdatadir_mapping::{AuxFilesDirectory, DirectoryKind};
use crate::tenant::timeline::logical_size::CurrentLogicalSize;
use crate::tenant::{
layer_map::{LayerMap, SearchResult},
Expand All @@ -75,6 +74,10 @@ use crate::{
disk_usage_eviction_task::EvictionCandidate, tenant::storage_layer::delta_layer::DeltaEntry,
};
use crate::{pgdatadir_mapping::LsnForTimestamp, tenant::tasks::BackgroundLoopKind};
use crate::{
pgdatadir_mapping::{AuxFilesDirectory, DirectoryKind},
virtual_file::MaybeFatalIo,
};

use crate::config::PageServerConf;
use crate::keyspace::{KeyPartitioning, KeySpace};
Expand Down Expand Up @@ -3426,10 +3429,14 @@ impl Timeline {
// The write_to_disk() above calls writer.finish() which already did the fsync of the inodes.
// We just need to fsync the directory in which these inodes are linked,
// which we know to be the timeline directory.
//
// We use fatal_err() below because the after write_to_disk returns with success,
// the in-memory state of the filesystem already has the layer file in its final place,
// and subsequent pageserver code could think it's durable while it really isn't.
par_fsync::par_fsync(&[self_clone
.conf
.timeline_path(&self_clone.tenant_shard_id, &self_clone.timeline_id)])
.context("fsync of timeline dir")?;
.fatal_err("fsync of timeline dir");

anyhow::Ok(new_delta)
}
Expand Down Expand Up @@ -3662,11 +3669,14 @@ impl Timeline {
// We just need to fsync the directory in which these inodes are linked,
// which we know to be the timeline directory.
if !image_layers.is_empty() {
// We use fatal_err() below because the after writer.finish() returns with success,
// the in-memory state of the filesystem already has the layer file in its final place,
// and subsequent pageserver code could think it's durable while it really isn't.
par_fsync::par_fsync_async(&[self
.conf
.timeline_path(&self.tenant_shard_id, &self.timeline_id)])
.await
.context("fsync of timeline dir")?;
.fatal_err("fsync of timeline dir");
}

let mut guard = self.layers.write().await;
Expand Down Expand Up @@ -4251,12 +4261,16 @@ impl Timeline {
// The writer.finish() above already did the fsync of the inodes.
// We just need to fsync the directory in which these inodes are linked,
// which we know to be the timeline directory.
//
// We use fatal_err() below because the after writer.finish() returns with success,
// the in-memory state of the filesystem already has the layer file in its final place,
// and subsequent pageserver code could think it's durable while it really isn't.
let timeline_dir = self
.conf
.timeline_path(&self.tenant_shard_id, &self.timeline_id);
par_fsync::par_fsync_async(&[timeline_dir])
.await
.context("fsync of timeline dir")?;
.fatal_err("fsync of timeline dir");
}

stats.write_layer_files_micros = stats.read_lock_drop_micros.till_now();
Expand Down

1 comment on commit c861d71

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2561 tests run: 2426 passed, 1 failed, 134 skipped (full report)


Failures on Postgres 14

  • test_basebackup_with_high_slru_count[github-actions-selfhosted-vectored-10-13-30]: release
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_basebackup_with_high_slru_count[release-pg14-github-actions-selfhosted-vectored-10-13-30]"
Flaky tests (2)

Postgres 16

  • test_remote_storage_upload_queue_retries: debug
  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 28.7% (6937 of 24179 functions)
  • lines: 47.2% (42536 of 90144 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
c861d71 at 2024-03-04T13:25:22.254Z :recycle:

Please sign in to comment.