storage controller: tactical AZ-aware scheduling #8848

VladLazar · 2024-08-27T10:56:03Z

This is a subset of what's proposed in #8264. Here we focus
on single shard tenants in order to enable migration from of cplane tenants to the storage controller.

External dependencies:
https://github.com/neondatabase/cloud/issues/15036

Once https://github.com/neondatabase/cloud/issues/15036 is done on the control plane side, the storage controller has implicit control on the ideal compute AZ. Conceptually, for each tenant shard, we should try to stay within the AZ of the initially scheduled pageserver (in practice computes do not suspend much).

Note that we are making a simplifying assumption here: control plane respects the AZ of the pageserver assigned to the tenant shard. In practice this is not always true:

compute pool for given AZ might be full
tenant is multi-sharded ...

Implementation guide:

Storcon keeps track of the AZ for each pageserver under its management (new db column). Open question: how do we backfill this?
Each tenant shard gets an associated home AZ (new db column). Conceptually, this is the AZ of the first pageserver it was scheduled on. For pre-existing tenant shards, it might make sense to backfill with the AZ that the storage controller is running in since this is what the control plane used as input to its scheduling until https://github.com/neondatabase/cloud/issues/15036.
Tweak the scheduling to have soft preference for az-preservation.

Assessing az-locality:
This grafana link has a couple of queries against the cplane database which tell us:

how many shards with an active compute are in one region
how many of those shards are being served by a pageserver which matches the compute az

## Problem In order to build AZ aware scheduling, the storage controller needs to know what AZ pageservers are in. Related #8848 ## Summary of changes This patch set adds a new nullable column to the `nodes` table: `availability_zone_id`. The node registration request is extended to include the AZ id (pageservers already have this in their `metadata.json` file). If the node is already registered, then we update the persistent and in-memory state with the provided AZ. Otherwise, we add the node with the AZ to begin with. A couple assumptions are made here: 1. Pageserver AZ ids are stable 2. AZ ids do not change over time Once all pageservers have a configured AZ, we can remove the optionals in the code and make the database column not nullable.

## Problem Storage controller didn't previously consider AZ locality between compute and pageservers when scheduling nodes. Control plane has this feature, and, since we are migrating tenants away from it, we need feature parity to avoid perf degradations. ## Summary of changes The change itself is fairly simple: 1. Thread az info into the scheduler 2. Add an extra member to the scheduling scores Step (2) deserves some more discussion. Let's break it down by the shard type being scheduled: **Attached Shards** We wish for attached shards of a tenant to end up in the preferred AZ of the tenant since that is where the compute is like to be. The AZ member for `NodeAttachmentSchedulingScore` has been placed below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node regions, since that would mean that one tenant can drive the general workload of an entire pageserver. I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place. **Secondary Shards** We wish for secondary shards of a tenant to be scheduled in a different AZ from the preferred one for HA purposes. The AZ member for `NodeSecondarySchedulingScore` has been placed first, so nodes in different AZs from the preferred one will always be considered first. On small clusters, this can mean that all the secondaries of a tenant are scheduled to the same pageserver, but secondaries don't use up as many resources as the attached location, so IMO the argument made for attached shards doesn't hold. Related: #8848

VladLazar · 2024-09-25T13:32:22Z

2024-09-25:

Support for AZ-aware scheduling targeting single sharded tenants merged in storcon: do az aware scheduling #9083
For prod: merge https://github.com/neondatabase/cloud/pull/17604 and run it in all prod regions to backfill preferred AZs for existing tenant shards

VladLazar · 2024-09-30T11:30:21Z

2024-09-30:

backfilled preferred az ids for ap-south* using https://github.com/neondatabase/cloud/pull/17604
let's monitor this week and do the other regions next week

VladLazar · 2024-10-07T12:37:57Z

2024-10-07:

It's a bit tricky to tell how well this went in ap-southeast-1 since we don't have pageservers in all AZs.
In total there are 153 active computes in ap-southeast-1 at the time of writing and 42 are matching the AZ of their pageserver.
However, there are only 66 active computes in AZs where we have pageservers, so I think it's actually working well.

I'll backfill AZs for the other regions today, then we can monitor for one more week.

VladLazar · 2024-10-14T12:05:18Z

2024-10-14

I slightly tweaked the queries to look only at computes created in the last 5 days and only at shard 0: with az match vs without az match.

Results look good. In us-east-2 95% of computes created during the last 5 days are in the same AZ as the pageserver serving shard zero. Other regions similar.

VladLazar · 2024-10-14T13:59:46Z

Lift the comparison above into an alert and close the issue.

VladLazar · 2024-10-14T16:00:20Z

We have an alert for this stuff now: https://neonprod.grafana.net/alerting/grafana/ce0v1dz6x60aob/view

VladLazar added c/storage Component: storage c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests labels Aug 27, 2024

VladLazar self-assigned this Aug 27, 2024

VladLazar mentioned this issue Aug 28, 2024

storcon: track pageserver availability zone #8852

Merged

5 tasks

jcsp mentioned this issue Sep 16, 2024

Epic: storage controller: AZ-aware scheduling #8264

Open

VladLazar mentioned this issue Sep 23, 2024

storcon: do az aware scheduling #9083

Merged

5 tasks

VladLazar closed this as completed Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: tactical AZ-aware scheduling #8848

storage controller: tactical AZ-aware scheduling #8848

VladLazar commented Aug 27, 2024

VladLazar commented Sep 25, 2024

VladLazar commented Sep 30, 2024

VladLazar commented Oct 7, 2024

VladLazar commented Oct 14, 2024

VladLazar commented Oct 14, 2024

VladLazar commented Oct 14, 2024

storage controller: tactical AZ-aware scheduling #8848

storage controller: tactical AZ-aware scheduling #8848

Comments

VladLazar commented Aug 27, 2024

VladLazar commented Sep 25, 2024

VladLazar commented Sep 30, 2024

VladLazar commented Oct 7, 2024

VladLazar commented Oct 14, 2024

VladLazar commented Oct 14, 2024

VladLazar commented Oct 14, 2024