storcon: AZ awareness when filling nodes #9947

jcsp · 2024-11-29T17:05:03Z

(Plan on bundling this into #9916)

Problem

When we fill a node after draining it, we should move back the shards whose home AZ matches the node being filled, rather than arbitrary shards.

Part of #8264

Summary of changes

While planning fills, exclude any shards that have an AZ preference that doesn't match the node being filled.

There are edge cases where this might not be ideal, such as if a system is extremely overloaded to the point that many shards are running in non-preferred AZs. However, filling nodes is an optimisation, not something that has to happen, and if a significant number of shards can't be scheduled into their preferred AZ then we have bigger problems.

github-actions · 2024-11-29T19:30:50Z

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)

Flaky tests (2)

Postgres 15

test_pull_timeline[True]: release-arm64

Postgres 14

test_pull_timeline[True]: release-arm64

Code coverage* (full report)

functions: 30.2% (8181 of 27046 functions)
lines: 47.7% (64849 of 135995 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
ac611a6 at 2024-11-29T19:30:50.259Z :recycle:}

VladLazar

Code looks good, but what does this mean for regions where there's only one PS per AZ? We end up with an unbalanced cluster after restart and optimizations, which are not AZ-affinity aware, kick in and start moving things anyway.

jcsp · 2024-12-02T15:21:17Z

Hmm, wasn't the existing code going to fight the optimiser too? (it still moves shards around to fill nodes, in ways that might disagree with the optimiser)

If necessary we can combine this one with the big AZ scheduling PR #9916 (still a WIP) which lines up the optimiser with AZ awareness.

VladLazar · 2024-12-02T16:24:15Z

Hmm, wasn't the existing code going to fight the optimiser too? (it still moves shards around to fill nodes, in ways that might disagree with the optimiser)

Kind of:

Existing code: We migrate based on shard count without any AZ awareness. The optimiser looks at node affinity and attachments in the context of a shard. Fill logic already tries to not load the node with too many shards of the same tenant. They shouldn't disagrees massively.
Proposed code: We end up with an empty node at the end of the rolling restart and the optimiser will fill it up with a slow trickle.

If necessary we can combine this one with the big AZ scheduling PR #9916 (still a WIP) which lines up the optimiser with AZ awareness.

Perhaps that would be better. wdyt?

## Problem When we update our scheduler/optimization code to respect AZs properly (#9916), the choice of AZ becomes a much higher-stakes decision. We will pretty much always run a tenant in its preferred AZ, and that AZ is fixed for the lifetime of the tenant (unless a human intervenes) Eventually, when we do auto-balancing based on utilization, I anticipate that part of that will be to automatically change the AZ of tenants if our original scheduling decisions have caused imbalance, but as an interim measure, we can at least avoid making this scheduling decision based purely on which AZ contains the emptiest node. This is a precursor to #9947 ## Summary of changes - When creating a tenant, instead of scheduling a shard and then reading its preferred AZ back, make the AZ decision first. - Instead of choosing AZ based on which node is emptiest, use the median utilization of nodes in each AZ to pick the AZ to use. This avoids bad AZ decisions during periods when some node has very low utilization (such as after replacing a dead node) I considered also making the selection a weighted pseudo-random choice based on utilization, but wanted to avoid destabilising tests with that for now.

## Problem It is unreliable for the control plane to infer the AZ for computes from where the tenant is currently attached, because if a tenant happens to be in a degraded state or a release is ongoing while a compute starts, then the tenant's attached AZ can be a different one to where it will run long-term, and the control plane doesn't check back later to restart the compute. This can land in parallel with #9947 ## Summary of changes - Thread through the preferred AZ into the compute hook code via the reconciler - Include the preferred AZ in the body of compute hook notifications

jcsp · 2024-12-19T10:10:35Z

This will land as part of #9916

storcon: AZ awareness when filling nodes

d064cf4

jcsp mentioned this pull request Nov 29, 2024

Epic: storage controller: AZ-aware scheduling #8264

Open

jcsp marked this pull request as ready for review November 29, 2024 18:21

jcsp requested a review from a team as a code owner November 29, 2024 18:21

jcsp requested review from yliang412 and VladLazar November 29, 2024 18:21

jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Nov 29, 2024

Merge branch 'main' into jcsp/az-fill

ac611a6

VladLazar reviewed Dec 2, 2024

View reviewed changes

This was referenced Dec 11, 2024

storcon: refine logic for choosing AZ on tenant creation #10054

Merged

storcon: include preferred AZ in compute notifications #9953

Merged

jcsp closed this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storcon: AZ awareness when filling nodes #9947

storcon: AZ awareness when filling nodes #9947

jcsp commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Nov 29, 2024

Postgres 15

Postgres 14

VladLazar left a comment

jcsp commented Dec 2, 2024

VladLazar commented Dec 2, 2024

jcsp commented Dec 19, 2024

storcon: AZ awareness when filling nodes #9947

storcon: AZ awareness when filling nodes #9947

Conversation

jcsp commented Nov 29, 2024 • edited Loading

Problem

Summary of changes

github-actions bot commented Nov 29, 2024

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)

Postgres 15

Postgres 14

Code coverage* (full report)

VladLazar left a comment

Choose a reason for hiding this comment

jcsp commented Dec 2, 2024

VladLazar commented Dec 2, 2024

jcsp commented Dec 19, 2024

jcsp commented Nov 29, 2024 •

edited

Loading