storcon: rework scheduler optimisation, prioritize AZ #9916

jcsp · 2024-11-27T18:42:29Z

Problem

We want to do a more robust job of scheduling tenants into their home AZ: #8264.

Closes: #8969

Summary of changes

TODO

Tasks

Give feedback

Reinstate check in test_shard_preferred_azs that was removed in storcon: in shard splits, inherit parent's AZ #9946
Clean up over-verbose logging in optimize functions
Add fast paths to skip scheduling calculations in already-optimal locations
Explicit tests for the new way we clean up secondaries (not during schedule, but during optimize)
Add some assertions in storcon scale test that it has respected AZ scheduling rules after chaos
Optimization loop doesn't touch all shards. Shards we don't check for optimization should have extra secondaries removed, rather than keeping them around until optimizer gets there.
Options

Scope

This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one.

Scheduling optimisation

TenantShard's optimize_attachment and optimize_secondary methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it:

To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later.
Score types get a for_optimization method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we will want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating.
When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well.
A unit test is added for storage controller: tenant creation sometimes disagrees with background optimization #8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.

github-actions · 2024-12-02T09:14:45Z

7245 tests run: 6936 passed, 2 failed, 307 skipped (full report)

Failures on Postgres 16

test_sharding_autosplit[github-actions-selfhosted]: release-x86-64
test_download_churn[github-actions-selfhosted-100-tokio-epoll-uring-30]: release-x86-64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sharding_autosplit[release-pg16-github-actions-selfhosted] or test_download_churn[release-pg16-github-actions-selfhosted-100-tokio-epoll-uring-30]"

Flaky tests (3)

Postgres 17

test_readonly_node_gc: release-arm64
test_sharding_split_smoke: release-arm64

Postgres 14

test_sharding_split_smoke: release-arm64

Code coverage* (full report)

functions: 31.3% (8430 of 26916 functions)
lines: 47.7% (66203 of 138664 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
3ee0050 at 2024-12-19T17:19:43.048Z :recycle:}

jcsp · 2024-12-09T10:57:31Z

(Note to self: stashed jcsp/storcon-multi-secondaries-20241209, this passes all tests apart from a flake in test_sharding_autosplit where that test takes longer to reach a stable state & sometimes times out)

## Problem When we update our scheduler/optimization code to respect AZs properly (#9916), the choice of AZ becomes a much higher-stakes decision. We will pretty much always run a tenant in its preferred AZ, and that AZ is fixed for the lifetime of the tenant (unless a human intervenes) Eventually, when we do auto-balancing based on utilization, I anticipate that part of that will be to automatically change the AZ of tenants if our original scheduling decisions have caused imbalance, but as an interim measure, we can at least avoid making this scheduling decision based purely on which AZ contains the emptiest node. This is a precursor to #9947 ## Summary of changes - When creating a tenant, instead of scheduling a shard and then reading its preferred AZ back, make the AZ decision first. - Instead of choosing AZ based on which node is emptiest, use the median utilization of nodes in each AZ to pick the AZ to use. This avoids bad AZ decisions during periods when some node has very low utilization (such as after replacing a dead node) I considered also making the selection a weighted pseudo-random choice based on utilization, but wanted to avoid destabilising tests with that for now.

VladLazar

Neat! I like how the two step migrations are expressed.

I'm a bit concerned by the amount of remote storage hydration this could end up causing on the pageservers ($$$). Do you have thoughts on that?

storage_controller/src/scheduler.rs

VladLazar · 2024-12-13T09:51:20Z

storage_controller/src/tenant_shard.rs

+    /// If we return true, it only means that optimization _might_ be possible, not that it necessarily is.  If we
+    /// return no, it definitely means that calling [`Self::optimize_attachment`] or [`Self::optimize_secondary`] would do no
+    /// work.
+    pub(crate) fn maybe_optimizable(


nit: This is a nice idea. However, we'll have to keep it in sync when adding changing any optimisations. How about attaching an is_possible closure to each optimisation type and calling them in a loop here.

That would be neat, but the checks aren't specific to the ScheduleOptimizationAction variants -- the check for [affinity scores is relevant to all but RemoteSecondary actions (they're the steps in moving something around).

I had some debug_assert! checks in optimize_all_plan that checked that when maybe_optimize says there's no work, there's really no work -- I've just extended those to run in release mode builds too (under testing feature), so we should get pretty good confidence that this function agrees with the optimization functions.

storage_controller/src/service.rs

VladLazar · 2024-12-13T10:02:01Z

storage_controller/src/tenant_shard.rs

+            if self.preferred_az_id.is_some()
+                && scheduler.get_node_az(&replacement) != self.preferred_az_id
+            {


nit: You could assert on this instead. az_match is the first member of NodeAttachmentSchedulingScore, so find_better_location should never return a location that's worse from the AZ pov.

find_better_location should never return a location that's worse from the AZ pov.

That's true, but it might return a location that's equally bad as our current location, if we're already outside the preferred AZ. What I'm saying with this condition is "if there's a location that's better, but still not in my preferred AZ, then don't spend the resources on migrating: hold out until there is a location in the preferred AZ".

I've rewritten the comment to make that clearer.

VladLazar · 2024-12-13T10:21:51Z

storage_controller/src/service.rs

+                    Some(az) if az == &node_az => {
+                        // This shard's home AZ is equal to the node we're filling: it is
+                        // elegible to be moved: fall through;
+                    }


Should we put shards with an explicit match to the front of the list? Otherwise, if the cluster is unbalanced, we'll stop the fill after getting full of shards with no preferred AZ and optimizations will have to do the lifting.

I'm thinking of shards with a None preferred_az as kind of a transitional state of affairs -- we'll have this problem for a little while but it'll go away once we fill out all the preferred_azs. If we have any long term desire for "floating" tenants then we can re-think.

To cope gracefully, we'd probably need to do more than just order the shards: because the shards are broken down into per-node lists, and we consume all the shards from the most loaded node first, we'd still end up consuming preferred_az=None shards from that most loaded node before preferred_az=Some shards from other nodes.

I think the neatest thing when rolling out will be to make sure most single sharded tenants do have a preferred AZ set (if we set this using your script it should end up matching their location and not result in lots of migration work with the new optimizer, while also clearing the preferred AZ for any large sharded tenants we're concerned about concentrating into a single AZ.

VladLazar · 2024-12-13T10:31:42Z

storage_controller/src/service.rs

@@ -6795,9 +6833,15 @@ impl Service {
    fn fill_node_plan(&self, node_id: NodeId) -> Vec<TenantShardId> {


I went looking at how we choose the secondary for draining. It's via Scheduler::node_preferred. That function still has the assumption that there's only one secondary. We should update it to compute node scores or at least choose the secondary with the home AZ.

Good catch. I think the right thing to do here is prefer the secondaries that aren't in the preferred AZ, to avoid scheduling into partly warmed up secondaries created for migrations. 3ee0050

It can now assert that shards get moved to their preferred AZs

We prefer secondaries outside our preferred AZ, as these are not the temporary ones used in optimization that are likely to have cold caches. Move it into TenantShard because its behavior is closely aligned with TenantShard's logic in optimize_attachment

jcsp added t/feature Issue type: feature, for new features or requests run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label c/storage/controller Component: Storage Controller labels Nov 27, 2024

jcsp mentioned this pull request Nov 28, 2024

storage controller: make scheduling more stable wrt optimization #9275

Closed

5 tasks

jcsp force-pushed the jcsp/storcon-multi-secondaries branch from 1eafcf2 to 03f09ae Compare December 2, 2024 08:59

jcsp mentioned this pull request Dec 2, 2024

storcon: AZ awareness when filling nodes #9947

Closed

jcsp force-pushed the jcsp/storcon-multi-secondaries branch 3 times, most recently from b64e73c to f943fa5 Compare December 5, 2024 17:37

jcsp marked this pull request as ready for review December 7, 2024 16:16

jcsp requested a review from a team as a code owner December 7, 2024 16:16

jcsp requested a review from problame December 7, 2024 16:16

jcsp marked this pull request as draft December 7, 2024 16:16

jcsp mentioned this pull request Dec 9, 2024

storcon: refine logic for choosing AZ on tenant creation #10054

Merged

jcsp removed the request for review from problame December 9, 2024 10:57

jcsp force-pushed the jcsp/storcon-multi-secondaries branch 2 times, most recently from d16b6c5 to fe2965c Compare December 10, 2024 12:24

VladLazar reviewed Dec 13, 2024

View reviewed changes

jcsp added 9 commits December 19, 2024 11:34

storcon: AZ awareness when filling nodes

6ece9a9

storcon: prioritize AZ over affinity in scheduler

fd63ee9

storcon: change Attached policy to mean _at least_ N secondaries

7513a01

storcon: update migration kick for multiple secondaries

da3a506

storcon: allow reconciles to succeed for down secondaries

3841cd3

storcon: add test for scheduling stability

6695ee0

tests: update test_shard_preferred_azs

be77017

It can now assert that shards get moved to their preferred AZs

storcon: rework scheduling optimisation

f329e40

tests: update shard split smoke

b1d863d

jcsp added 10 commits December 19, 2024 11:36

DNM: reduce max optimisations to simplify testing

c1fc0da

DNM: small scale test

ade0537

DNM: cargo: add test-log

987d540

tests: update test_shard_preferred_azs

3108963

storcon: remove unused part of ScheduleContext

a44a4ca

fixup: stale comment

f21770d

storcon: extend assertions around maybe_optimizable

d3af061

fixup: typo

8ce6f9e

fixup: clarify why we check AZ explicitly & add an assertion

25ae0ad

jcsp force-pushed the jcsp/storcon-multi-secondaries branch from fe2965c to 3ee0050 Compare December 19, 2024 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storcon: rework scheduler optimisation, prioritize AZ #9916

storcon: rework scheduler optimisation, prioritize AZ #9916

jcsp commented Nov 27, 2024 •

edited

Loading

Tasks

github-actions bot commented Dec 2, 2024 •

edited

Loading

Postgres 17

Postgres 14

jcsp commented Dec 9, 2024

VladLazar left a comment

VladLazar Dec 13, 2024

jcsp Dec 19, 2024

jcsp Dec 19, 2024

VladLazar Dec 13, 2024

jcsp Dec 19, 2024

jcsp Dec 19, 2024

VladLazar Dec 13, 2024

jcsp Dec 19, 2024

VladLazar Dec 13, 2024

jcsp Dec 19, 2024

		@@ -6795,9 +6833,15 @@ impl Service {
		fn fill_node_plan(&self, node_id: NodeId) -> Vec<TenantShardId> {

storcon: rework scheduler optimisation, prioritize AZ #9916

Are you sure you want to change the base?

storcon: rework scheduler optimisation, prioritize AZ #9916

Conversation

jcsp commented Nov 27, 2024 • edited Loading

Problem

Summary of changes

TODO

Tasks

Scope

Scheduling optimisation

github-actions bot commented Dec 2, 2024 • edited Loading

7245 tests run: 6936 passed, 2 failed, 307 skipped (full report)

Failures on Postgres 16

Postgres 17

Postgres 14

Code coverage* (full report)

jcsp commented Dec 9, 2024

VladLazar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Nov 27, 2024 •

edited

Loading

github-actions bot commented Dec 2, 2024 •

edited

Loading