-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storcon: do az aware scheduling #9083
Conversation
5012 tests run: 4847 passed, 1 failed, 164 skipped (full report)Failures on Postgres 16
Flaky tests (5)Postgres 17
Postgres 16
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
ab411e3 at 2024-09-25T13:07:42.384Z :recycle: |
9a57eec
to
7f6dab5
Compare
7f6dab5
to
6d0679d
Compare
I agree with this in the near term. The consequence will be that for 8 sharded tenants, until we have 24 pageservers in a region, they'll get spread across AZs rather than concentrated in one (this is not a regression, just calling out the behavior). We may well want to evolve this in a few months in #8264, to do something like clamping affinity scores to permit a smalll number of a tenant's shards to co-exist on the same pageserver, so that we can get total AZ locality as soon as we have 3-4 pageservers per region. |
Indeed. This PR mostly targets single sharded tenants such that they don't experience degradation when migrated |
Problem
Storage controller didn't previously consider AZ locality between compute and pageservers
when scheduling nodes. Control plane has this feature, and, since we are migrating tenants
away from it, we need feature parity to avoid perf degradations.
Summary of changes
The change itself is fairly simple:
Step (2) deserves some more discussion. Let's break it down by the shard type being scheduled:
Attached Shards
We wish for attached shards of a tenant to end up in the preferred AZ of the tenant since that
is where the compute is like to be.
The AZ member for
NodeAttachmentSchedulingScore
has been placedbelow the affinity score (so it's got the second biggest weight for picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place.
Secondary Shards
We wish for secondary shards of a tenant to be scheduled in a different AZ from the preferred one
for HA purposes.
The AZ member for
NodeSecondarySchedulingScore
has been placed first, so nodes in different AZsfrom the preferred one will always be considered first. On small clusters, this can mean that all the secondaries
of a tenant are scheduled to the same pageserver, but secondaries don't use up as many resources as the
attached location, so IMO the argument made for attached shards doesn't hold.
Related: #8848
Checklist before requesting a review
Checklist before merging