Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logstore: sideloaded storage is not atomic #136416

Open
pav-kv opened this issue Nov 29, 2024 · 1 comment
Open

logstore: sideloaded storage is not atomic #136416

pav-kv opened this issue Nov 29, 2024 · 1 comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 29, 2024

Background

Raft log storage is mostly contained in Pebble, which provides atomicity guarantee for writes: each write batch is either fully applied or fully discarded. For example, if the node crashes before a write has been flushed/synced, the write will be fully discarded post restart.

There is a specialized part of the raft log storage: sideloaded storage. It is used for storing AddSSTable raft entries, which are typically large, in the order of tens of MiB. The motivation there is avoiding the write amplification of Pebble - AddSSTable commands are stored directly as files, and "referenced" from the raft log entries.

The implementation of the sideloaded storage does not provide atomicity guarantees like Pebble:

  1. A file can be partially written (e.g. if interrupted by a crash).
  2. When multiple files are written in one "batch", it can be interrupted (by a crash) and complete only partially.
  3. When entries are overwritten (e.g. when a new leader regresses a follower's log), multiple entries at the same log index can be present simultaneously.
  4. Writes are not atomic with the corresponding Pebble writes, so files can be written, but end up "dangling" if there is a crash before the subsequent Pebble write.

This historically caused bugs like #38566 and #113135. The workaround for this lack of atomicity is to sync newly added files before committing "references" to them in Pebble, and to sync the truncated state changes in Pebble (i.e. "unreference" the files) before removing the files.

Issues

Log size tracking is imprecise.

The log storage size delta computations are sensitive to these partial writes. The raftLogSizeTrusted field aims to catch some situations when the size might be imprecise, but it doesn't consider all corner cases.

For example, if there is a crash during TruncateTo call (after we have already durably applied the new truncated state), the next truncation post restart will observe (and account) more entries than the actual delta between the old and new truncated state. But we will not notice this impreciseness.

If there is a crash during a situation (3) when leader overwrites a follower's entries, we can end up with multiple entries at the same log index (but different terms). This is fine/correct w.r.t. raft (we read entries by index/term pair in the file name). But the TruncateTo call (and other raft log size recomputation funcs) will count them all (because it filters files only by index).

Entry removals are imprecise.

When removing sideloaded entries in case (3), the code assumes that all these entries have the same term. This is generally not true, so some files can be left dangling until some other TruncateTo call removes them as a drive-by.

Jira issue: CRDB-45026

@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv KV Team labels Nov 29, 2024
Copy link

blathers-crl bot commented Nov 29, 2024

Hi @pav-kv, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@pav-kv pav-kv added the branch-master Failures and bugs on the master branch. label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant