-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage controller: tactical AZ-aware scheduling #8848
Comments
## Problem In order to build AZ aware scheduling, the storage controller needs to know what AZ pageservers are in. Related #8848 ## Summary of changes This patch set adds a new nullable column to the `nodes` table: `availability_zone_id`. The node registration request is extended to include the AZ id (pageservers already have this in their `metadata.json` file). If the node is already registered, then we update the persistent and in-memory state with the provided AZ. Otherwise, we add the node with the AZ to begin with. A couple assumptions are made here: 1. Pageserver AZ ids are stable 2. AZ ids do not change over time Once all pageservers have a configured AZ, we can remove the optionals in the code and make the database column not nullable.
## Problem Storage controller didn't previously consider AZ locality between compute and pageservers when scheduling nodes. Control plane has this feature, and, since we are migrating tenants away from it, we need feature parity to avoid perf degradations. ## Summary of changes The change itself is fairly simple: 1. Thread az info into the scheduler 2. Add an extra member to the scheduling scores Step (2) deserves some more discussion. Let's break it down by the shard type being scheduled: **Attached Shards** We wish for attached shards of a tenant to end up in the preferred AZ of the tenant since that is where the compute is like to be. The AZ member for `NodeAttachmentSchedulingScore` has been placed below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node regions, since that would mean that one tenant can drive the general workload of an entire pageserver. I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place. **Secondary Shards** We wish for secondary shards of a tenant to be scheduled in a different AZ from the preferred one for HA purposes. The AZ member for `NodeSecondarySchedulingScore` has been placed first, so nodes in different AZs from the preferred one will always be considered first. On small clusters, this can mean that all the secondaries of a tenant are scheduled to the same pageserver, but secondaries don't use up as many resources as the attached location, so IMO the argument made for attached shards doesn't hold. Related: #8848
2024-09-25:
|
2024-09-30:
|
2024-10-07: It's a bit tricky to tell how well this went in ap-southeast-1 since we don't have pageservers in all AZs. I'll backfill AZs for the other regions today, then we can monitor for one more week. |
2024-10-14 I slightly tweaked the queries to look only at computes created in the last 5 days and only at shard 0: with az match vs without az match. Results look good. In us-east-2 95% of computes created during the last 5 days are in the same AZ as the pageserver serving shard zero. Other regions similar. |
Lift the comparison above into an alert and close the issue. |
We have an alert for this stuff now: https://neonprod.grafana.net/alerting/grafana/ce0v1dz6x60aob/view |
This is a subset of what's proposed in #8264. Here we focus
on single shard tenants in order to enable migration from of cplane tenants to the storage controller.
External dependencies:
https://github.com/neondatabase/cloud/issues/15036
Once https://github.com/neondatabase/cloud/issues/15036 is done on the control plane side, the storage controller has implicit control on the ideal compute AZ. Conceptually, for each tenant shard, we should try to stay within the AZ of the initially scheduled pageserver (in practice computes do not suspend much).
Note that we are making a simplifying assumption here: control plane respects the AZ of the pageserver assigned to the tenant shard. In practice this is not always true:
Implementation guide:
Assessing az-locality:
This grafana link has a couple of queries against the cplane database which tell us:
The text was updated successfully, but these errors were encountered: