-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storcon: rework scheduler optimisation, prioritize AZ #9916
base: main
Are you sure you want to change the base?
Conversation
1eafcf2
to
03f09ae
Compare
7245 tests run: 6936 passed, 2 failed, 307 skipped (full report)Failures on Postgres 16
Flaky tests (3)Postgres 17
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
3ee0050 at 2024-12-19T17:19:43.048Z :recycle: |
b64e73c
to
f943fa5
Compare
(Note to self: stashed |
d16b6c5
to
fe2965c
Compare
## Problem When we update our scheduler/optimization code to respect AZs properly (#9916), the choice of AZ becomes a much higher-stakes decision. We will pretty much always run a tenant in its preferred AZ, and that AZ is fixed for the lifetime of the tenant (unless a human intervenes) Eventually, when we do auto-balancing based on utilization, I anticipate that part of that will be to automatically change the AZ of tenants if our original scheduling decisions have caused imbalance, but as an interim measure, we can at least avoid making this scheduling decision based purely on which AZ contains the emptiest node. This is a precursor to #9947 ## Summary of changes - When creating a tenant, instead of scheduling a shard and then reading its preferred AZ back, make the AZ decision first. - Instead of choosing AZ based on which node is emptiest, use the median utilization of nodes in each AZ to pick the AZ to use. This avoids bad AZ decisions during periods when some node has very low utilization (such as after replacing a dead node) I considered also making the selection a weighted pseudo-random choice based on utilization, but wanted to avoid destabilising tests with that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! I like how the two step migrations are expressed.
I'm a bit concerned by the amount of remote storage hydration this could end up causing on the pageservers ($$$). Do you have thoughts on that?
/// If we return true, it only means that optimization _might_ be possible, not that it necessarily is. If we | ||
/// return no, it definitely means that calling [`Self::optimize_attachment`] or [`Self::optimize_secondary`] would do no | ||
/// work. | ||
pub(crate) fn maybe_optimizable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is a nice idea. However, we'll have to keep it in sync when adding changing any optimisations. How about attaching an is_possible
closure to each optimisation type and calling them in a loop here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be neat, but the checks aren't specific to the ScheduleOptimizationAction variants -- the check for [affinity scores is relevant to all but RemoteSecondary actions (they're the steps in moving something around).
I had some debug_assert! checks in optimize_all_plan
that checked that when maybe_optimize says there's no work, there's really no work -- I've just extended those to run in release mode builds too (under testing feature), so we should get pretty good confidence that this function agrees with the optimization functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.preferred_az_id.is_some() | ||
&& scheduler.get_node_az(&replacement) != self.preferred_az_id | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You could assert on this instead. az_match
is the first member of NodeAttachmentSchedulingScore
, so find_better_location
should never return a location that's worse from the AZ pov.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find_better_location should never return a location that's worse from the AZ pov.
That's true, but it might return a location that's equally bad as our current location, if we're already outside the preferred AZ. What I'm saying with this condition is "if there's a location that's better, but still not in my preferred AZ, then don't spend the resources on migrating: hold out until there is a location in the preferred AZ".
I've rewritten the comment to make that clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some(az) if az == &node_az => { | ||
// This shard's home AZ is equal to the node we're filling: it is | ||
// elegible to be moved: fall through; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put shards with an explicit match to the front of the list? Otherwise, if the cluster is unbalanced, we'll stop the fill after getting full of shards with no preferred AZ and optimizations will have to do the lifting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of shards with a None preferred_az as kind of a transitional state of affairs -- we'll have this problem for a little while but it'll go away once we fill out all the preferred_azs. If we have any long term desire for "floating" tenants then we can re-think.
To cope gracefully, we'd probably need to do more than just order the shards: because the shards are broken down into per-node lists, and we consume all the shards from the most loaded node first, we'd still end up consuming preferred_az=None shards from that most loaded node before preferred_az=Some shards from other nodes.
I think the neatest thing when rolling out will be to make sure most single sharded tenants do have a preferred AZ set (if we set this using your script it should end up matching their location and not result in lots of migration work with the new optimizer, while also clearing the preferred AZ for any large sharded tenants we're concerned about concentrating into a single AZ.
@@ -6795,9 +6833,15 @@ impl Service { | |||
fn fill_node_plan(&self, node_id: NodeId) -> Vec<TenantShardId> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went looking at how we choose the secondary for draining. It's via Scheduler::node_preferred
. That function still has the assumption that there's only one secondary. We should update it to compute node scores or at least choose the secondary with the home AZ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think the right thing to do here is prefer the secondaries that aren't in the preferred AZ, to avoid scheduling into partly warmed up secondaries created for migrations. 3ee0050
It can now assert that shards get moved to their preferred AZs
We prefer secondaries outside our preferred AZ, as these are not the temporary ones used in optimization that are likely to have cold caches. Move it into TenantShard because its behavior is closely aligned with TenantShard's logic in optimize_attachment
fe2965c
to
3ee0050
Compare
Problem
We want to do a more robust job of scheduling tenants into their home AZ: #8264.
Closes: #8969
Summary of changes
TODO
Tasks
Scope
This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one.
Scheduling optimisation
TenantShard
'soptimize_attachment
andoptimize_secondary
methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it:for_optimization
method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we will want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating.