Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: rework scheduler optimisation, prioritize AZ #9916

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Nov 27, 2024

Problem

We want to do a more robust job of scheduling tenants into their home AZ: #8264.

Closes: #8969

Summary of changes

TODO

Tasks

Preview Give feedback

Scope

This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one.

Scheduling optimisation

TenantShard's optimize_attachment and optimize_secondary methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it:

  • To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later.
  • Score types get a for_optimization method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we will want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating.
  • When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well.
  • A unit test is added for storage controller: tenant creation sometimes disagrees with background optimization #8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.

@jcsp jcsp added t/feature Issue type: feature, for new features or requests run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label c/storage/controller Component: Storage Controller labels Nov 27, 2024
@jcsp jcsp force-pushed the jcsp/storcon-multi-secondaries branch from 1eafcf2 to 03f09ae Compare December 2, 2024 08:59
Copy link

github-actions bot commented Dec 2, 2024

7245 tests run: 6936 passed, 2 failed, 307 skipped (full report)


Failures on Postgres 16

  • test_sharding_autosplit[github-actions-selfhosted]: release-x86-64
  • test_download_churn[github-actions-selfhosted-100-tokio-epoll-uring-30]: release-x86-64
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sharding_autosplit[release-pg16-github-actions-selfhosted] or test_download_churn[release-pg16-github-actions-selfhosted-100-tokio-epoll-uring-30]"
Flaky tests (3)

Postgres 17

Postgres 14

Code coverage* (full report)

  • functions: 31.3% (8430 of 26916 functions)
  • lines: 47.7% (66203 of 138664 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
3ee0050 at 2024-12-19T17:19:43.048Z :recycle:

@jcsp jcsp force-pushed the jcsp/storcon-multi-secondaries branch 3 times, most recently from b64e73c to f943fa5 Compare December 5, 2024 17:37
@jcsp jcsp marked this pull request as ready for review December 7, 2024 16:16
@jcsp jcsp requested a review from a team as a code owner December 7, 2024 16:16
@jcsp jcsp requested a review from problame December 7, 2024 16:16
@jcsp jcsp marked this pull request as draft December 7, 2024 16:16
@jcsp
Copy link
Collaborator Author

jcsp commented Dec 9, 2024

(Note to self: stashed jcsp/storcon-multi-secondaries-20241209, this passes all tests apart from a flake in test_sharding_autosplit where that test takes longer to reach a stable state & sometimes times out)

@jcsp jcsp removed the request for review from problame December 9, 2024 10:57
@jcsp jcsp force-pushed the jcsp/storcon-multi-secondaries branch 2 times, most recently from d16b6c5 to fe2965c Compare December 10, 2024 12:24
github-merge-queue bot pushed a commit that referenced this pull request Dec 12, 2024
## Problem

When we update our scheduler/optimization code to respect AZs properly
(#9916), the choice of AZ
becomes a much higher-stakes decision. We will pretty much always run a
tenant in its preferred AZ, and that AZ is fixed for the lifetime of the
tenant (unless a human intervenes)

Eventually, when we do auto-balancing based on utilization, I anticipate
that part of that will be to automatically change the AZ of tenants if
our original scheduling decisions have caused imbalance, but as an
interim measure, we can at least avoid making this scheduling decision
based purely on which AZ contains the emptiest node.

This is a precursor to #9947

## Summary of changes

- When creating a tenant, instead of scheduling a shard and then reading
its preferred AZ back, make the AZ decision first.
- Instead of choosing AZ based on which node is emptiest, use the median
utilization of nodes in each AZ to pick the AZ to use. This avoids bad
AZ decisions during periods when some node has very low utilization
(such as after replacing a dead node)

I considered also making the selection a weighted pseudo-random choice
based on utilization, but wanted to avoid destabilising tests with that
for now.
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I like how the two step migrations are expressed.


I'm a bit concerned by the amount of remote storage hydration this could end up causing on the pageservers ($$$). Do you have thoughts on that?

storage_controller/src/scheduler.rs Outdated Show resolved Hide resolved
/// If we return true, it only means that optimization _might_ be possible, not that it necessarily is. If we
/// return no, it definitely means that calling [`Self::optimize_attachment`] or [`Self::optimize_secondary`] would do no
/// work.
pub(crate) fn maybe_optimizable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is a nice idea. However, we'll have to keep it in sync when adding changing any optimisations. How about attaching an is_possible closure to each optimisation type and calling them in a loop here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be neat, but the checks aren't specific to the ScheduleOptimizationAction variants -- the check for [affinity scores is relevant to all but RemoteSecondary actions (they're the steps in moving something around).

I had some debug_assert! checks in optimize_all_plan that checked that when maybe_optimize says there's no work, there's really no work -- I've just extended those to run in release mode builds too (under testing feature), so we should get pretty good confidence that this function agrees with the optimization functions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_controller/src/service.rs Outdated Show resolved Hide resolved
Comment on lines +960 to +963
if self.preferred_az_id.is_some()
&& scheduler.get_node_az(&replacement) != self.preferred_az_id
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could assert on this instead. az_match is the first member of NodeAttachmentSchedulingScore, so find_better_location should never return a location that's worse from the AZ pov.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_better_location should never return a location that's worse from the AZ pov.

That's true, but it might return a location that's equally bad as our current location, if we're already outside the preferred AZ. What I'm saying with this condition is "if there's a location that's better, but still not in my preferred AZ, then don't spend the resources on migrating: hold out until there is a location in the preferred AZ".

I've rewritten the comment to make that clearer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +6865 to +6957
Some(az) if az == &node_az => {
// This shard's home AZ is equal to the node we're filling: it is
// elegible to be moved: fall through;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put shards with an explicit match to the front of the list? Otherwise, if the cluster is unbalanced, we'll stop the fill after getting full of shards with no preferred AZ and optimizations will have to do the lifting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of shards with a None preferred_az as kind of a transitional state of affairs -- we'll have this problem for a little while but it'll go away once we fill out all the preferred_azs. If we have any long term desire for "floating" tenants then we can re-think.

To cope gracefully, we'd probably need to do more than just order the shards: because the shards are broken down into per-node lists, and we consume all the shards from the most loaded node first, we'd still end up consuming preferred_az=None shards from that most loaded node before preferred_az=Some shards from other nodes.

I think the neatest thing when rolling out will be to make sure most single sharded tenants do have a preferred AZ set (if we set this using your script it should end up matching their location and not result in lots of migration work with the new optimizer, while also clearing the preferred AZ for any large sharded tenants we're concerned about concentrating into a single AZ.

@@ -6795,9 +6833,15 @@ impl Service {
fn fill_node_plan(&self, node_id: NodeId) -> Vec<TenantShardId> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went looking at how we choose the secondary for draining. It's via Scheduler::node_preferred. That function still has the assumption that there's only one secondary. We should update it to compute node scores or at least choose the secondary with the home AZ.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think the right thing to do here is prefer the secondaries that aren't in the preferred AZ, to avoid scheduling into partly warmed up secondaries created for migrations. 3ee0050

@jcsp jcsp force-pushed the jcsp/storcon-multi-secondaries branch from fe2965c to 3ee0050 Compare December 19, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage controller: tenant creation sometimes disagrees with background optimization
2 participants