Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: storage controller: AZ-aware scheduling #8264

Open
4 of 11 tasks
VladLazar opened this issue Jul 4, 2024 · 3 comments
Open
4 of 11 tasks

Epic: storage controller: AZ-aware scheduling #8264

VladLazar opened this issue Jul 4, 2024 · 3 comments
Assignees
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/feature Issue type: feature, for new features or requests

Comments

@VladLazar
Copy link
Contributor

VladLazar commented Jul 4, 2024

The storage controller does not take pageserver availability zone into account when scheduling tenants.

Ideally, the following scheduling constraints would be followed:

  • Secondaries in a different AZ to attached locations. Avoids downtime in the event that an entire AZ goes down.
  • Attached locations in an AZ that's consistent over time (modulo when we're doing a restart or failure) so that computes can be scheduled there. The control plane infers the AZ on which the compute should be scheduled
    from the compute notification it receives from the storage controller. Keeping the AZ consistent avoids cross AZ traffic between compute and pageserver.
  • If there are enough pageservers in one AZ, try to put all the shards in a tenant in the same AZ so that a compute in that region gets AZ-local performance (again, modulo when we're doing a restart or failure)

Tasks

Preview Give feedback
  1. c/storage c/storage/controller t/feature
    VladLazar
  2. a/tech_debt c/storage/controller
    jcsp
  3. c/storage/controller t/feature
  4. c/storage/controller t/feature
@VladLazar VladLazar added c/storage Component: storage c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests labels Jul 4, 2024
@jcsp jcsp changed the title storage controller: AZ-aware scheduling Epic: storage controller: AZ-aware scheduling Sep 16, 2024
@jcsp
Copy link
Collaborator

jcsp commented Nov 22, 2024

Thinking a bit about this:

  • AZ affinity gives each shard a true "primary" and "secondary" location. We can use those in the scheduler instead of attached/secondary counts which fluctuate depending on what's going on.
  • For optimization, we will need to handle shards having dual secondaries: when we want to smoothly relocate the primary, we will need to create another secondary in the home AZ, while leaving also a secondary in a remote AZ in case of failures.
  • It probably makes sense to just always schedule all the shards of a sharded tenant in the home AZ, without trying to do any special casing for the mini-region case -- in such regions if one AZ's pageserver becomes overloaded, it just becomes time to expand the total pageserver count. This is preferable to degrading the performance of a large tenant by scattering its shards across regions.

@VladLazar
Copy link
Contributor Author

AZ affinity gives each shard a true "primary" and "secondary" location. We can use those in the scheduler instead of attached/secondary counts which fluctuate depending on what's going on.

I don't really get what you mean here. AZ affinity gives you a preferred AZ. Do you mean that "primary = anywhere in the preferred AZ" and "secondary = anywhere not in the preferred AZ"?

It probably makes sense to just always schedule all the shards of a sharded tenant in the home AZ, without trying to do any special casing for the mini-region case -- in such regions if one AZ's pageserver becomes overloaded, it just becomes time to expand the total pageserver count. This is preferable to degrading the performance of a large tenant by scattering its shards across regions.

Yeah, I agree with this. Primary should stay in the preferred AZ at all times apart from the unexpected failover case.

We should also think about the control plane and compute part of the stack. Currently, cplane tries to schedule in the same AZ as the node_id in tenant_shards table. This works okay for new tenant shards, but not great for existing ones. Since computes can be really long lived, perhaps preferred AZ should flow down from cplane (e.g. at start-up storcon calls into cplane for current compute AZ of each tenant shard).

@jcsp jcsp self-assigned this Nov 25, 2024
@jcsp
Copy link
Collaborator

jcsp commented Nov 25, 2024

I don't really get what you mean here. AZ affinity gives you a preferred AZ. Do you mean that "primary = anywhere in the preferred AZ" and "secondary = anywhere not in the preferred AZ"?

Right.

Currently, cplane tries to schedule in the same AZ as the node_id in tenant_shards table. This works okay for new tenant shards, but not great for existing ones.

Yes, let's track that separately in https://github.com/neondatabase/cloud/issues/17159 & use this ticket for the controller end of things

github-merge-queue bot pushed a commit that referenced this issue Nov 29, 2024
…#9908)

## Problem

When picking locations for a shard, we should use a ScheduleContext that
includes all the other shards in the tenant, so that we apply proper
anti-affinity between shards. If we don't do this, then it can lead to
unstable scheduling, where we place a shard somewhere that the optimizer
will then immediately move it away from.

We didn't always do this, because it was a bit awkward to accumulate the
context for a tenant rather than just walking tenants.

This was a TODO in `handle_node_availability_transition`:
```
                        // TODO: populate a ScheduleContext including all shards in the same tenant_id (only matters
                        // for tenants without secondary locations: if they have a secondary location, then this
                        // schedule() call is just promoting an existing secondary)
```

This is a precursor to #8264,
where the current imperfect scheduling during node evacuation hampers
testing.

## Summary of changes

- Add an iterator type that yields each shard along with a
schedulecontext that includes all the other shards from the same tenant
- Use the iterator to replace hand-crafted logic in optimize_all_plan
(functionally identical)
- Use the iterator in `handle_node_availability_transition` to apply
proper anti-affinity during node evacuation.
github-merge-queue bot pushed a commit that referenced this issue Dec 3, 2024
## Problem

Sharded tenants should be run in a single AZ for best performance, so
that computes have AZ-local latency to all the shards.

Part of #8264

## Summary of changes

- When we split a tenant, instead of updating each shard's preferred AZ
to wherever it is scheduled, propagate the preferred AZ from the parent.
- Drop the check in `test_shard_preferred_azs` that asserts shards end
up in their preferred AZ: this will not be true again until the
optimize_attachment logic is updated to make this so. The existing check
wasn't testing anything about scheduling, it was just asserting that we
set preferred AZ in a way that matches the way things happen to be
scheduled at time of split.
awarus pushed a commit that referenced this issue Dec 5, 2024
…#9908)

## Problem

When picking locations for a shard, we should use a ScheduleContext that
includes all the other shards in the tenant, so that we apply proper
anti-affinity between shards. If we don't do this, then it can lead to
unstable scheduling, where we place a shard somewhere that the optimizer
will then immediately move it away from.

We didn't always do this, because it was a bit awkward to accumulate the
context for a tenant rather than just walking tenants.

This was a TODO in `handle_node_availability_transition`:
```
                        // TODO: populate a ScheduleContext including all shards in the same tenant_id (only matters
                        // for tenants without secondary locations: if they have a secondary location, then this
                        // schedule() call is just promoting an existing secondary)
```

This is a precursor to #8264,
where the current imperfect scheduling during node evacuation hampers
testing.

## Summary of changes

- Add an iterator type that yields each shard along with a
schedulecontext that includes all the other shards from the same tenant
- Use the iterator to replace hand-crafted logic in optimize_all_plan
(functionally identical)
- Use the iterator in `handle_node_availability_transition` to apply
proper anti-affinity during node evacuation.
awarus pushed a commit that referenced this issue Dec 5, 2024
## Problem

Sharded tenants should be run in a single AZ for best performance, so
that computes have AZ-local latency to all the shards.

Part of #8264

## Summary of changes

- When we split a tenant, instead of updating each shard's preferred AZ
to wherever it is scheduled, propagate the preferred AZ from the parent.
- Drop the check in `test_shard_preferred_azs` that asserts shards end
up in their preferred AZ: this will not be true again until the
optimize_attachment logic is updated to make this so. The existing check
wasn't testing anything about scheduling, it was just asserting that we
set preferred AZ in a way that matches the way things happen to be
scheduled at time of split.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

2 participants