Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: refine logic for choosing AZ on tenant creation #10054

Merged
merged 5 commits into from
Dec 12, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Dec 9, 2024

Problem

When we update our scheduler/optimization code to respect AZs properly (#9916), the choice of AZ becomes a much higher-stakes decision. We will pretty much always run a tenant in its preferred AZ, and that AZ is fixed for the lifetime of the tenant (unless a human intervenes)

Eventually, when we do auto-balancing based on utilization, I anticipate that part of that will be to automatically change the AZ of tenants if our original scheduling decisions have caused imbalance, but as an interim measure, we can at least avoid making this scheduling decision based purely on which AZ contains the emptiest node.

This is a precursor to #9947

Summary of changes

  • When creating a tenant, instead of scheduling a shard and then reading its preferred AZ back, make the AZ decision first.
  • Instead of choosing AZ based on which node is emptiest, use the median utilization of nodes in each AZ to pick the AZ to use. This avoids bad AZ decisions during periods when some node has very low utilization (such as after replacing a dead node)

I considered also making the selection a weighted pseudo-random choice based on utilization, but wanted to avoid destabilising tests with that for now.

@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Dec 9, 2024
@jcsp jcsp changed the title Jcsp/tenant create az allocation storcon: refine logic for choosing AZ on tenant creation Dec 9, 2024
Copy link

github-actions bot commented Dec 9, 2024

7051 tests run: 6725 passed, 1 failed, 325 skipped (full report)


Failures on Postgres 14

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_prefetch[release-pg14-4]"
Flaky tests (4)

Postgres 16

Postgres 15

Test coverage report is not available

The comment gets automatically updated with the latest test results
ecbe8e8 at 2024-12-09T11:56:34.246Z :recycle:

@jcsp jcsp force-pushed the jcsp/tenant-create-az-allocation branch from f8e37eb to ecbe8e8 Compare December 9, 2024 10:55
@jcsp jcsp marked this pull request as ready for review December 12, 2024 10:10
@jcsp jcsp requested a review from a team as a code owner December 12, 2024 10:10
@jcsp jcsp requested review from yliang412 and VladLazar December 12, 2024 10:10
libs/pageserver_api/src/models/utilization.rs Outdated Show resolved Hide resolved
@jcsp jcsp enabled auto-merge December 12, 2024 18:24
@jcsp jcsp added this pull request to the merge queue Dec 12, 2024
Merged via the queue into main with commit a93e3d3 Dec 12, 2024
79 checks passed
@jcsp jcsp deleted the jcsp/tenant-create-az-allocation branch December 12, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants