Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: tactical AZ-aware scheduling #8848

Closed
Tracked by #8264
VladLazar opened this issue Aug 27, 2024 · 6 comments
Closed
Tracked by #8264

storage controller: tactical AZ-aware scheduling #8848

VladLazar opened this issue Aug 27, 2024 · 6 comments
Assignees
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/feature Issue type: feature, for new features or requests

Comments

@VladLazar
Copy link
Contributor

This is a subset of what's proposed in #8264. Here we focus
on single shard tenants in order to enable migration from of cplane tenants to the storage controller.

External dependencies:
https://github.com/neondatabase/cloud/issues/15036

Once https://github.com/neondatabase/cloud/issues/15036 is done on the control plane side, the storage controller has implicit control on the ideal compute AZ. Conceptually, for each tenant shard, we should try to stay within the AZ of the initially scheduled pageserver (in practice computes do not suspend much).

Note that we are making a simplifying assumption here: control plane respects the AZ of the pageserver assigned to the tenant shard. In practice this is not always true:

  • compute pool for given AZ might be full
  • tenant is multi-sharded ...

Implementation guide:

  • Storcon keeps track of the AZ for each pageserver under its management (new db column). Open question: how do we backfill this?
  • Each tenant shard gets an associated home AZ (new db column). Conceptually, this is the AZ of the first pageserver it was scheduled on. For pre-existing tenant shards, it might make sense to backfill with the AZ that the storage controller is running in since this is what the control plane used as input to its scheduling until https://github.com/neondatabase/cloud/issues/15036.
  • Tweak the scheduling to have soft preference for az-preservation.

Assessing az-locality:
This grafana link has a couple of queries against the cplane database which tell us:

  1. how many shards with an active compute are in one region
  2. how many of those shards are being served by a pageserver which matches the compute az
@VladLazar VladLazar added c/storage Component: storage c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests labels Aug 27, 2024
@VladLazar VladLazar self-assigned this Aug 27, 2024
VladLazar added a commit that referenced this issue Aug 28, 2024
## Problem
In order to build AZ aware scheduling, the storage controller needs to
know what AZ pageservers are in.

Related #8848

## Summary of changes
This patch set adds a new nullable column to the `nodes` table:
`availability_zone_id`. The node registration
request is extended to include the AZ id (pageservers already have this
in their `metadata.json` file).

If the node is already registered, then we update the persistent and
in-memory state with the provided AZ.
Otherwise, we add the node with the AZ to begin with.

A couple assumptions are made here:
1. Pageserver AZ ids are stable
2. AZ ids do not change over time

Once all pageservers have a configured AZ, we can remove the optionals
in the code and make the database column not nullable.
VladLazar added a commit that referenced this issue Sep 25, 2024
## Problem

Storage controller didn't previously consider AZ locality between
compute and pageservers
when scheduling nodes. Control plane has this feature, and, since we are
migrating tenants
away from it, we need feature parity to avoid perf degradations.

## Summary of changes

The change itself is fairly simple:
1. Thread az info into the scheduler
2. Add an extra member to the scheduling scores

Step (2) deserves some more discussion. Let's break it down by the shard
type being scheduled:

**Attached Shards**

We wish for attached shards of a tenant to end up in the preferred AZ of
the tenant since that
is where the compute is like to be. 

The AZ member for `NodeAttachmentSchedulingScore` has been placed
below the affinity score (so it's got the second biggest weight for
picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single
tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general
workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing
hoisting the AZ up to first place.

 **Secondary Shards**

We wish for secondary shards of a tenant to be scheduled in a different
AZ from the preferred one
for HA purposes.

The AZ member for `NodeSecondarySchedulingScore` has been placed first,
so nodes in different AZs
from the preferred one will always be considered first. On small
clusters, this can mean that all the secondaries
of a tenant are scheduled to the same pageserver, but secondaries don't
use up as many resources as the
attached location, so IMO the argument made for attached shards doesn't
hold.

Related: #8848
@VladLazar
Copy link
Contributor Author

2024-09-25:

@VladLazar
Copy link
Contributor Author

2024-09-30:

@VladLazar
Copy link
Contributor Author

2024-10-07:

It's a bit tricky to tell how well this went in ap-southeast-1 since we don't have pageservers in all AZs.
In total there are 153 active computes in ap-southeast-1 at the time of writing and 42 are matching the AZ of their pageserver.
However, there are only 66 active computes in AZs where we have pageservers, so I think it's actually working well.

I'll backfill AZs for the other regions today, then we can monitor for one more week.

@VladLazar
Copy link
Contributor Author

2024-10-14

I slightly tweaked the queries to look only at computes created in the last 5 days and only at shard 0: with az match vs without az match.

Results look good. In us-east-2 95% of computes created during the last 5 days are in the same AZ as the pageserver serving shard zero. Other regions similar.

@VladLazar
Copy link
Contributor Author

Lift the comparison above into an alert and close the issue.

@VladLazar
Copy link
Contributor Author

We have an alert for this stuff now: https://neonprod.grafana.net/alerting/grafana/ce0v1dz6x60aob/view

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

1 participant