Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: do az aware scheduling #9083

Merged
merged 7 commits into from
Sep 25, 2024
Merged

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Sep 20, 2024

Problem

Storage controller didn't previously consider AZ locality between compute and pageservers
when scheduling nodes. Control plane has this feature, and, since we are migrating tenants
away from it, we need feature parity to avoid perf degradations.

Summary of changes

The change itself is fairly simple:

  1. Thread az info into the scheduler
  2. Add an extra member to the scheduling scores

Step (2) deserves some more discussion. Let's break it down by the shard type being scheduled:

Attached Shards

We wish for attached shards of a tenant to end up in the preferred AZ of the tenant since that
is where the compute is like to be.

The AZ member for NodeAttachmentSchedulingScore has been placed
below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place.

Secondary Shards

We wish for secondary shards of a tenant to be scheduled in a different AZ from the preferred one
for HA purposes.

The AZ member for NodeSecondarySchedulingScore has been placed first, so nodes in different AZs
from the preferred one will always be considered first. On small clusters, this can mean that all the secondaries
of a tenant are scheduled to the same pageserver, but secondaries don't use up as many resources as the
attached location, so IMO the argument made for attached shards doesn't hold.

Related: #8848

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Copy link

github-actions bot commented Sep 20, 2024

5012 tests run: 4847 passed, 1 failed, 164 skipped (full report)


Failures on Postgres 16

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_storage_controller_heartbeats[release-pg16-failure4]"
Flaky tests (5)

Postgres 17

Postgres 16

Code coverage* (full report)

  • functions: 32.2% (7494 of 23275 functions)
  • lines: 50.1% (60478 of 120819 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
ab411e3 at 2024-09-25T13:07:42.384Z :recycle:

@VladLazar VladLazar force-pushed the vlad/storcon-az-aware-scheduling branch from 9a57eec to 7f6dab5 Compare September 23, 2024 14:49
@VladLazar VladLazar changed the base branch from main to vlad/storcon-improve-initial-scheduling September 23, 2024 14:49
@VladLazar VladLazar changed the title storcon(wip): do az aware scheduling storcon: do az aware scheduling Sep 23, 2024
@VladLazar VladLazar marked this pull request as ready for review September 23, 2024 15:04
@VladLazar VladLazar requested a review from a team as a code owner September 23, 2024 15:04
@VladLazar VladLazar requested review from problame and jcsp and removed request for a team and problame September 23, 2024 15:04
Base automatically changed from vlad/storcon-improve-initial-scheduling to main September 24, 2024 09:03
@VladLazar VladLazar force-pushed the vlad/storcon-az-aware-scheduling branch from 7f6dab5 to 6d0679d Compare September 24, 2024 10:53
@jcsp
Copy link
Collaborator

jcsp commented Sep 24, 2024

The AZ member for NodeAttachmentSchedulingScore has been placed
below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place.

I agree with this in the near term.

The consequence will be that for 8 sharded tenants, until we have 24 pageservers in a region, they'll get spread across AZs rather than concentrated in one (this is not a regression, just calling out the behavior). We may well want to evolve this in a few months in #8264, to do something like clamping affinity scores to permit a smalll number of a tenant's shards to co-exist on the same pageserver, so that we can get total AZ locality as soon as we have 3-4 pageservers per region.

storage_controller/src/scheduler.rs Outdated Show resolved Hide resolved
storage_controller/src/scheduler.rs Outdated Show resolved Hide resolved
storage_controller/src/scheduler.rs Outdated Show resolved Hide resolved
storage_controller/src/scheduler.rs Outdated Show resolved Hide resolved
storage_controller/src/service.rs Outdated Show resolved Hide resolved
@VladLazar
Copy link
Contributor Author

The AZ member for NodeAttachmentSchedulingScore has been placed
below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place.

I agree with this in the near term.

The consequence will be that for 8 sharded tenants, until we have 24 pageservers in a region, they'll get spread across AZs rather than concentrated in one (this is not a regression, just calling out the behavior). We may well want to evolve this in a few months in #8264, to do something like clamping affinity scores to permit a smalll number of a tenant's shards to co-exist on the same pageserver, so that we can get total AZ locality as soon as we have 3-4 pageservers per region.

Indeed. This PR mostly targets single sharded tenants such that they don't experience degradation when migrated
away from cplane. 100% this can evolve into something that deals with many-sharded tenants in a nicer way.

@VladLazar VladLazar merged commit 2cf47b1 into main Sep 25, 2024
80 checks passed
@VladLazar VladLazar deleted the vlad/storcon-az-aware-scheduling branch September 25, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants