storcon: do az aware scheduling #9083

VladLazar · 2024-09-20T17:11:00Z

Problem

Storage controller didn't previously consider AZ locality between compute and pageservers
when scheduling nodes. Control plane has this feature, and, since we are migrating tenants
away from it, we need feature parity to avoid perf degradations.

Summary of changes

The change itself is fairly simple:

Thread az info into the scheduler
Add an extra member to the scheduling scores

Step (2) deserves some more discussion. Let's break it down by the shard type being scheduled:

Attached Shards

We wish for attached shards of a tenant to end up in the preferred AZ of the tenant since that
is where the compute is like to be.

The AZ member for NodeAttachmentSchedulingScore has been placed
below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place.

Secondary Shards

We wish for secondary shards of a tenant to be scheduled in a different AZ from the preferred one
for HA purposes.

The AZ member for NodeSecondarySchedulingScore has been placed first, so nodes in different AZs
from the preferred one will always be considered first. On small clusters, this can mean that all the secondaries
of a tenant are scheduled to the same pageserver, but secondaries don't use up as many resources as the
attached location, so IMO the argument made for attached shards doesn't hold.

Related: #8848

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-09-20T19:22:06Z

5012 tests run: 4847 passed, 1 failed, 164 skipped (full report)

Failures on Postgres 16

test_storage_controller_heartbeats[failure4]: release-x86-64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_storage_controller_heartbeats[release-pg16-failure4]"

Flaky tests (5)

Postgres 17

test_pageserver_compaction_smoke: release-x86-64
test_subscriber_restart: debug-x86-64

Postgres 16

test_ondemand_wal_download_in_replication_slot_funcs: release-x86-64
test_subscriber_restart: release-x86-64
test_delete_timeline_client_hangup: release-x86-64

Code coverage* (full report)

functions: 32.2% (7494 of 23275 functions)
lines: 50.1% (60478 of 120819 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
ab411e3 at 2024-09-25T13:07:42.384Z :recycle:}

jcsp · 2024-09-24T16:34:37Z

The AZ member for NodeAttachmentSchedulingScore has been placed
below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place.

I agree with this in the near term.

The consequence will be that for 8 sharded tenants, until we have 24 pageservers in a region, they'll get spread across AZs rather than concentrated in one (this is not a regression, just calling out the behavior). We may well want to evolve this in a few months in #8264, to do something like clamping affinity scores to permit a smalll number of a tenant's shards to co-exist on the same pageserver, so that we can get total AZ locality as soon as we have 3-4 pageservers per region.

storage_controller/src/scheduler.rs

storage_controller/src/service.rs

VladLazar · 2024-09-25T10:28:33Z

The AZ member for NodeAttachmentSchedulingScore has been placed
below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place.

I agree with this in the near term.

The consequence will be that for 8 sharded tenants, until we have 24 pageservers in a region, they'll get spread across AZs rather than concentrated in one (this is not a regression, just calling out the behavior). We may well want to evolve this in a few months in #8264, to do something like clamping affinity scores to permit a smalll number of a tenant's shards to co-exist on the same pageserver, so that we can get total AZ locality as soon as we have 3-4 pageservers per region.

Indeed. This PR mostly targets single sharded tenants such that they don't experience degradation when migrated
away from cplane. 100% this can evolve into something that deals with many-sharded tenants in a nicer way.

VladLazar force-pushed the vlad/storcon-az-aware-scheduling branch from 9a57eec to 7f6dab5 Compare September 23, 2024 14:49

VladLazar changed the base branch from main to vlad/storcon-improve-initial-scheduling September 23, 2024 14:49

VladLazar changed the title ~~storcon(wip): do az aware scheduling~~ storcon: do az aware scheduling Sep 23, 2024

VladLazar marked this pull request as ready for review September 23, 2024 15:04

VladLazar requested a review from a team as a code owner September 23, 2024 15:04

VladLazar requested review from problame and jcsp and removed request for a team and problame September 23, 2024 15:04

Base automatically changed from vlad/storcon-improve-initial-scheduling to main September 24, 2024 09:03

VladLazar added 3 commits September 24, 2024 11:53

storcon: thread az id to SchedulerNode

94fd45a

storcon: add preferred az id to schedule_shard calls

f3779ca

storcon: az aware scheduling

6d0679d

VladLazar force-pushed the vlad/storcon-az-aware-scheduling branch from 7f6dab5 to 6d0679d Compare September 24, 2024 10:53

jcsp reviewed Sep 24, 2024

View reviewed changes

VladLazar added 4 commits September 25, 2024 12:45

review: add newtype for az id

6f63062

review: add separate types for az matching

8d51003

review: remove ambiguous comment

416ad12

review: add a wrapper for getting node for background work

ab411e3

jcsp approved these changes Sep 25, 2024

View reviewed changes

VladLazar merged commit 2cf47b1 into main Sep 25, 2024
80 checks passed

VladLazar deleted the vlad/storcon-az-aware-scheduling branch September 25, 2024 13:31

VladLazar mentioned this pull request Sep 25, 2024

storage controller: tactical AZ-aware scheduling #8848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storcon: do az aware scheduling #9083

storcon: do az aware scheduling #9083

VladLazar commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading

Postgres 17

Postgres 16

jcsp commented Sep 24, 2024 •

edited

Loading

VladLazar commented Sep 25, 2024

storcon: do az aware scheduling #9083

storcon: do az aware scheduling #9083

Conversation

VladLazar commented Sep 20, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Sep 20, 2024 • edited Loading

5012 tests run: 4847 passed, 1 failed, 164 skipped (full report)

Failures on Postgres 16

Postgres 17

Postgres 16

Code coverage* (full report)

jcsp commented Sep 24, 2024 • edited Loading

VladLazar commented Sep 25, 2024

VladLazar commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading

jcsp commented Sep 24, 2024 •

edited

Loading