logstore: sideloaded storage is not atomic #136416

pav-kv · 2024-11-29T20:14:30Z

Background

Raft log storage is mostly contained in Pebble, which provides atomicity guarantee for writes: each write batch is either fully applied or fully discarded. For example, if the node crashes before a write has been flushed/synced, the write will be fully discarded post restart.

There is a specialized part of the raft log storage: sideloaded storage. It is used for storing AddSSTable raft entries, which are typically large, in the order of tens of MiB. The motivation there is avoiding the write amplification of Pebble - AddSSTable commands are stored directly as files, and "referenced" from the raft log entries.

The implementation of the sideloaded storage does not provide atomicity guarantees like Pebble:

A file can be partially written (e.g. if interrupted by a crash).
When multiple files are written in one "batch", it can be interrupted (by a crash) and complete only partially.
When entries are overwritten (e.g. when a new leader regresses a follower's log), multiple entries at the same log index can be present simultaneously.
Writes are not atomic with the corresponding Pebble writes, so files can be written, but end up "dangling" if there is a crash before the subsequent Pebble write.

This historically caused bugs like #38566 and #113135. The workaround for this lack of atomicity is to sync newly added files before committing "references" to them in Pebble, and to sync the truncated state changes in Pebble (i.e. "unreference" the files) before removing the files.

Issues

Log size tracking is imprecise.

The log storage size delta computations are sensitive to these partial writes. The raftLogSizeTrusted field aims to catch some situations when the size might be imprecise, but it doesn't consider all corner cases.

For example, if there is a crash during TruncateTo call (after we have already durably applied the new truncated state), the next truncation post restart will observe (and account) more entries than the actual delta between the old and new truncated state. But we will not notice this impreciseness.

If there is a crash during a situation (3) when leader overwrites a follower's entries, we can end up with multiple entries at the same log index (but different terms). This is fine/correct w.r.t. raft (we read entries by index/term pair in the file name). But the TruncateTo call (and other raft log size recomputation funcs) will count them all (because it filters files only by index).

Entry removals are imprecise.

When removing sideloaded entries in case (3), the code assumes that all these entries have the same term. This is generally not true, so some files can be left dangling until some other TruncateTo call removes them as a drive-by.

Jira issue: CRDB-45026

The text was updated successfully, but these errors were encountered:

blathers-crl · 2024-11-29T20:14:32Z

Hi @pav-kv, please add branch-* labels to identify which branch(es) this C-bug affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv KV Team labels Nov 29, 2024

pav-kv added the branch-master Failures and bugs on the master branch. label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logstore: sideloaded storage is not atomic #136416

logstore: sideloaded storage is not atomic #136416

pav-kv commented Nov 29, 2024 •

edited

Loading

blathers-crl bot commented Nov 29, 2024

logstore: sideloaded storage is not atomic #136416

logstore: sideloaded storage is not atomic #136416

Comments

pav-kv commented Nov 29, 2024 • edited Loading

Background

Issues

Log size tracking is imprecise.

Entry removals are imprecise.

blathers-crl bot commented Nov 29, 2024

pav-kv commented Nov 29, 2024 •

edited

Loading