Epic: storage controller: AZ-aware scheduling #8264

VladLazar · 2024-07-04T11:18:26Z

The storage controller does not take pageserver availability zone into account when scheduling tenants.

Ideally, the following scheduling constraints would be followed:

Secondaries in a different AZ to attached locations. Avoids downtime in the event that an entire AZ goes down.
Attached locations in an AZ that's consistent over time (modulo when we're doing a restart or failure) so that computes can be scheduled there. The control plane infers the AZ on which the compute should be scheduled
from the compute notification it receives from the storage controller. Keeping the AZ consistent avoids cross AZ traffic between compute and pageserver.
If there are enough pageservers in one AZ, try to put all the shards in a tenant in the same AZ so that a compute in that region gets AZ-local performance (again, modulo when we're doing a restart or failure)

Tasks

Give feedback

storage controller: tactical AZ-aware scheduling #8848

c/storage c/storage/controller t/feature
https://github.com/neondatabase/cloud/issues/15413
storage controller: tenant creation sometimes disagrees with background optimization #8969

a/tech_debt c/storage/controller
storcon: in shard splits, inherit parent's AZ #9946
storcon: AZ awareness when filling nodes #9947

c/storage/controller t/feature
storcon: include preferred AZ in compute notifications #9953

c/storage/controller t/feature
Schedule secondaries in a different AZ than attached locations
Schedule attached locations into the same (home) AZ if there are enough pageservers there.
In chaos tests, confirm that once idle the system converges on putting all the shards in their home AZs
During tenant creation, pick home AZ based on AZ-wide utilization
Rolling out: make sure most shards have a preferred AZ set
Options

jcsp · 2024-11-22T12:58:31Z

Thinking a bit about this:

AZ affinity gives each shard a true "primary" and "secondary" location. We can use those in the scheduler instead of attached/secondary counts which fluctuate depending on what's going on.
For optimization, we will need to handle shards having dual secondaries: when we want to smoothly relocate the primary, we will need to create another secondary in the home AZ, while leaving also a secondary in a remote AZ in case of failures.
It probably makes sense to just always schedule all the shards of a sharded tenant in the home AZ, without trying to do any special casing for the mini-region case -- in such regions if one AZ's pageserver becomes overloaded, it just becomes time to expand the total pageserver count. This is preferable to degrading the performance of a large tenant by scattering its shards across regions.

VladLazar · 2024-11-25T10:40:16Z

AZ affinity gives each shard a true "primary" and "secondary" location. We can use those in the scheduler instead of attached/secondary counts which fluctuate depending on what's going on.

I don't really get what you mean here. AZ affinity gives you a preferred AZ. Do you mean that "primary = anywhere in the preferred AZ" and "secondary = anywhere not in the preferred AZ"?

It probably makes sense to just always schedule all the shards of a sharded tenant in the home AZ, without trying to do any special casing for the mini-region case -- in such regions if one AZ's pageserver becomes overloaded, it just becomes time to expand the total pageserver count. This is preferable to degrading the performance of a large tenant by scattering its shards across regions.

Yeah, I agree with this. Primary should stay in the preferred AZ at all times apart from the unexpected failover case.

We should also think about the control plane and compute part of the stack. Currently, cplane tries to schedule in the same AZ as the node_id in tenant_shards table. This works okay for new tenant shards, but not great for existing ones. Since computes can be really long lived, perhaps preferred AZ should flow down from cplane (e.g. at start-up storcon calls into cplane for current compute AZ of each tenant shard).

jcsp · 2024-11-25T12:08:53Z

I don't really get what you mean here. AZ affinity gives you a preferred AZ. Do you mean that "primary = anywhere in the preferred AZ" and "secondary = anywhere not in the preferred AZ"?

Right.

Currently, cplane tries to schedule in the same AZ as the node_id in tenant_shards table. This works okay for new tenant shards, but not great for existing ones.

Yes, let's track that separately in https://github.com/neondatabase/cloud/issues/17159 & use this ticket for the controller end of things

…#9908) ## Problem When picking locations for a shard, we should use a ScheduleContext that includes all the other shards in the tenant, so that we apply proper anti-affinity between shards. If we don't do this, then it can lead to unstable scheduling, where we place a shard somewhere that the optimizer will then immediately move it away from. We didn't always do this, because it was a bit awkward to accumulate the context for a tenant rather than just walking tenants. This was a TODO in `handle_node_availability_transition`: ``` // TODO: populate a ScheduleContext including all shards in the same tenant_id (only matters // for tenants without secondary locations: if they have a secondary location, then this // schedule() call is just promoting an existing secondary) ``` This is a precursor to #8264, where the current imperfect scheduling during node evacuation hampers testing. ## Summary of changes - Add an iterator type that yields each shard along with a schedulecontext that includes all the other shards from the same tenant - Use the iterator to replace hand-crafted logic in optimize_all_plan (functionally identical) - Use the iterator in `handle_node_availability_transition` to apply proper anti-affinity during node evacuation.

## Problem Sharded tenants should be run in a single AZ for best performance, so that computes have AZ-local latency to all the shards. Part of #8264 ## Summary of changes - When we split a tenant, instead of updating each shard's preferred AZ to wherever it is scheduled, propagate the preferred AZ from the parent. - Drop the check in `test_shard_preferred_azs` that asserts shards end up in their preferred AZ: this will not be true again until the optimize_attachment logic is updated to make this so. The existing check wasn't testing anything about scheduling, it was just asserting that we set preferred AZ in a way that matches the way things happen to be scheduled at time of split.

…#9908) ## Problem When picking locations for a shard, we should use a ScheduleContext that includes all the other shards in the tenant, so that we apply proper anti-affinity between shards. If we don't do this, then it can lead to unstable scheduling, where we place a shard somewhere that the optimizer will then immediately move it away from. We didn't always do this, because it was a bit awkward to accumulate the context for a tenant rather than just walking tenants. This was a TODO in `handle_node_availability_transition`: ``` // TODO: populate a ScheduleContext including all shards in the same tenant_id (only matters // for tenants without secondary locations: if they have a secondary location, then this // schedule() call is just promoting an existing secondary) ``` This is a precursor to #8264, where the current imperfect scheduling during node evacuation hampers testing. ## Summary of changes - Add an iterator type that yields each shard along with a schedulecontext that includes all the other shards from the same tenant - Use the iterator to replace hand-crafted logic in optimize_all_plan (functionally identical) - Use the iterator in `handle_node_availability_transition` to apply proper anti-affinity during node evacuation.

## Problem Sharded tenants should be run in a single AZ for best performance, so that computes have AZ-local latency to all the shards. Part of #8264 ## Summary of changes - When we split a tenant, instead of updating each shard's preferred AZ to wherever it is scheduled, propagate the preferred AZ from the parent. - Drop the check in `test_shard_preferred_azs` that asserts shards end up in their preferred AZ: this will not be true again until the optimize_attachment logic is updated to make this so. The existing check wasn't testing anything about scheduling, it was just asserting that we set preferred AZ in a way that matches the way things happen to be scheduled at time of split.

VladLazar added c/storage Component: storage c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests labels Jul 4, 2024

VladLazar mentioned this issue Aug 27, 2024

storage controller: tactical AZ-aware scheduling #8848

Closed

jcsp changed the title ~~storage controller: AZ-aware scheduling~~ Epic: storage controller: AZ-aware scheduling Sep 16, 2024

jcsp mentioned this issue Sep 24, 2024

storcon: do az aware scheduling #9083

Merged

5 tasks

jcsp self-assigned this Nov 25, 2024

This was referenced Nov 27, 2024

storage controller: optimize placement of tenant shards when a node comes back online #7139

Closed

storage controller: use proper ScheduleContext when evacuating a node #9908

Merged

storcon: rework scheduler optimisation, prioritize AZ #9916

Draft

This was referenced Nov 29, 2024

storcon: AZ awareness when filling nodes #9947

Closed

storcon: in shard splits, inherit parent's AZ #9946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: storage controller: AZ-aware scheduling #8264

Epic: storage controller: AZ-aware scheduling #8264

VladLazar commented Jul 4, 2024 •

edited by jcsp

Loading

Tasks

jcsp commented Nov 22, 2024

VladLazar commented Nov 25, 2024

jcsp commented Nov 25, 2024

Epic: storage controller: AZ-aware scheduling #8264

Epic: storage controller: AZ-aware scheduling #8264

Comments

VladLazar commented Jul 4, 2024 • edited by jcsp Loading

Tasks

jcsp commented Nov 22, 2024

VladLazar commented Nov 25, 2024

jcsp commented Nov 25, 2024

VladLazar commented Jul 4, 2024 •

edited by jcsp

Loading