Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: only store SLRUs & aux files on shard zero #9786

Merged
merged 10 commits into from
Dec 3, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Nov 18, 2024

Problem

Since #9423 the non-zero shards no longer need SLRU content in order to do GC. This data is now redundant on shards >0.

One release cycle after merging that PR, we may merge this one, which also stops writing those pages to shards > 0, reaping the efficiency benefit.

Closes: #7512
Closes: #9641

Summary of changes

  • Avoid storing SLRUs on non-zero shards
  • Bonus: avoid storing aux files on non-zero shards

@jcsp jcsp changed the title Jcsp/slrus on shard 0 pt2 pageserver: only store SLRUs on shard zero Nov 18, 2024
Copy link

github-actions bot commented Nov 18, 2024

7018 tests run: 6710 passed, 0 failed, 308 skipped (full report)


Flaky tests (7)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 30.4% (8273 of 27230 functions)
  • lines: 47.7% (65232 of 136621 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
6875f94 at 2024-12-02T18:09:16.385Z :recycle:

github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2024
## Problem

SLRU blocks, which can add up to several gigabytes, are currently
ingested by all shards, multiplying their capacity cost by the shard
count and slowing down ingest. We do this because all shards need the
SLRU pages to do timestamp->LSN lookup for GC.

Related: #7512

## Summary of changes

- On non-zero shards, learn the GC offset from shard 0's index instead
of calculating it.
- Add a test `test_sharding_gc` that exercises this
- Do GC in test_pg_regress as a general smoke test that GC functions run
(e.g. this would fail if we were using SLRUs we didn't have)

In this PR we are still ingesting SLRUs everywhere, but not using them
any more. Part 2 PR (#9786)
makes the change to not store them at all.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
@jcsp jcsp force-pushed the jcsp/slrus-on-shard-0-pt2 branch from 001341a to 636e702 Compare November 20, 2024 16:19
@jcsp jcsp changed the title pageserver: only store SLRUs on shard zero pageserver: only store SLRUs & aux files on shard zero Nov 20, 2024
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Unsure about the checkpoint change though.

pageserver/src/walingest.rs Outdated Show resolved Hide resolved
@jcsp
Copy link
Collaborator Author

jcsp commented Nov 28, 2024

Rebased on main to de-conflict with decoder.rs changes.

Assuming it's all good, let's merge this on Monday after the release is cut, so that we get a full week to bake in staging.

@jcsp jcsp force-pushed the jcsp/slrus-on-shard-0-pt2 branch from 3331d68 to 4e22e26 Compare November 28, 2024 10:07
@jcsp jcsp marked this pull request as ready for review November 28, 2024 10:07
@jcsp jcsp requested a review from a team as a code owner November 28, 2024 10:07
@jcsp jcsp requested a review from erikgrinaker November 28, 2024 10:07
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Nov 28, 2024
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I convinced myself in #9423 that this is ok.

libs/pageserver_api/src/shard.rs Show resolved Hide resolved
libs/pageserver_api/src/shard.rs Show resolved Hide resolved
libs/wal_decoder/src/decoder.rs Outdated Show resolved Hide resolved
pageserver/src/pgdatadir_mapping.rs Show resolved Hide resolved
@jcsp jcsp enabled auto-merge December 2, 2024 16:56
@jcsp jcsp added this pull request to the merge queue Dec 3, 2024
Merged via the queue into main with commit dcb6295 Dec 3, 2024
80 checks passed
@jcsp jcsp deleted the jcsp/slrus-on-shard-0-pt2 branch December 3, 2024 17:24
awarus pushed a commit that referenced this pull request Dec 5, 2024
## Problem

Since #9423 the non-zero shards
no longer need SLRU content in order to do GC. This data is now
redundant on shards >0.

One release cycle after merging that PR, we may merge this one, which
also stops writing those pages to shards > 0, reaping the efficiency
benefit.

Closes: #7512
Closes: #9641

## Summary of changes

- Avoid storing SLRUs on non-zero shards
- Bonus: avoid storing aux files on non-zero shards
github-merge-queue bot pushed a commit that referenced this pull request Dec 11, 2024
## Problem

In #9786 we stop storing SLRUs on non-zero shards.

However, there was one code path during ingest that still tries to
enumerate SLRU relations on all shards. This fails if it sees a tenant
who has never seen any write to an SLRU, or who has done such thorough
compaction+GC that it has dropped its SLRU directory key.

## Summary of changes

- Avoid trying to list SLRU relations on nonzero shards
github-merge-queue bot pushed a commit that referenced this pull request Dec 16, 2024
## Problem

Changes in #9786 were functionally complete but missed some edges that
made testing less robust than it should have been:
- `is_key_disposable` didn't consider SLRU dir keys disposable
- Timeline `init_empty` was always creating SLRU dir keys on all shards

The result was that when we had a bug
(#10080), it wasn't apparent in
tests, because one would only encounter the issue if running on a
long-lived timeline with enough compaction to drop the initially created
empty SLRU dir keys, _and_ some CLog truncation going on.

Closes: neondatabase/cloud#21516

## Summary of changes

- Update is_key_global and init_empty to handle SLRU dir keys properly
-- the only functional impact is that we avoid writing some spurious
keys in shards >0, but this makes testing much more robust.
- Make `test_clog_truncate` explicitly use a sharded tenant

The net result is that if one reverts #10080, then tests fail (i.e. this
PR is a reproducer for the issue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests
Projects
None yet
3 participants