logstore: sideloaded storage is not atomic #136416
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Background
Raft log storage is mostly contained in Pebble, which provides atomicity guarantee for writes: each write batch is either fully applied or fully discarded. For example, if the node crashes before a write has been flushed/synced, the write will be fully discarded post restart.
There is a specialized part of the raft log storage: sideloaded storage. It is used for storing
AddSSTable
raft entries, which are typically large, in the order of tens of MiB. The motivation there is avoiding the write amplification of Pebble -AddSSTable
commands are stored directly as files, and "referenced" from the raft log entries.The implementation of the sideloaded storage does not provide atomicity guarantees like Pebble:
This historically caused bugs like #38566 and #113135. The workaround for this lack of atomicity is to sync newly added files before committing "references" to them in Pebble, and to sync the truncated state changes in Pebble (i.e. "unreference" the files) before removing the files.
Issues
Log size tracking is imprecise.
The log storage size delta computations are sensitive to these partial writes. The raftLogSizeTrusted field aims to catch some situations when the size might be imprecise, but it doesn't consider all corner cases.
For example, if there is a crash during
TruncateTo
call (after we have already durably applied the new truncated state), the next truncation post restart will observe (and account) more entries than the actual delta between the old and new truncated state. But we will not notice this impreciseness.If there is a crash during a situation (3) when leader overwrites a follower's entries, we can end up with multiple entries at the same log index (but different terms). This is fine/correct w.r.t. raft (we read entries by index/term pair in the file name). But the
TruncateTo
call (and other raft log size recomputation funcs) will count them all (because it filters files only by index).Entry removals are imprecise.
When removing sideloaded entries in case (3), the code assumes that all these entries have the same term. This is generally not true, so some files can be left dangling until some other
TruncateTo
call removes them as a drive-by.Jira issue: CRDB-45026
The text was updated successfully, but these errors were encountered: