Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control_plane/attachment_service: improve Scheduler #6633

Merged
merged 10 commits into from
Feb 19, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Feb 5, 2024

Problem

One of the major shortcuts in the initial version of this code was to construct a fresh Scheduler each time we need it, which is an O(N^2) cost as the tenant count increases.

Summary of changes

  • Keep Scheduler alive through the lifetime of ServiceState
  • Use IntentState as a reference tracking helper, updating Scheduler refcounts as nodes are added/removed from the intent.

There is an automated test that checks things don't get pathologically slow with thousands of shards, but it's not included in this PR because tests that implicitly test the runner node performance take some thought to stabilize/land in CI.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added the c/storage Component: storage label Feb 5, 2024
Copy link

github-actions bot commented Feb 5, 2024

2442 tests run: 2324 passed, 0 failed, 118 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_sharding_split_smoke: release

Postgres 14

  • test_sharding_split_smoke: release

Code coverage (full report)

  • functions: 55.8% (12941 of 23187 functions)
  • lines: 82.5% (70067 of 84925 lines)

The comment gets automatically updated with the latest test results
e68a8d1 at 2024-02-19T11:35:12.366Z :recycle:

@jcsp jcsp force-pushed the jcsp/improved-scheduler-mk2-cutdown branch from b97ff44 to cd73a97 Compare February 9, 2024 11:13
@jcsp jcsp force-pushed the jcsp/improved-scheduler-mk2-cutdown branch from cd73a97 to 0b51a07 Compare February 12, 2024 12:56
@jcsp jcsp marked this pull request as ready for review February 12, 2024 12:59
@jcsp jcsp requested review from a team as code owners February 12, 2024 12:59
@jcsp jcsp requested review from lubennikovaav and problame and removed request for a team and lubennikovaav February 12, 2024 12:59
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transition to get_attached() could have been a preliminary PR, would have reduced the noise quite a bit.


What are your plans for debuggability of the reference counting?
because I'm not at all confident that

  1. I spotted any possible bugs and
  2. We won't have refcounting errors in the future

What alternatives to manual refcounting exist & why were they not viable?


I'll have to spend more time reviewing the correctness of the reference counting once I know we're definitely going to take that route versus a TBD alternative.

control_plane/attachment_service/src/scheduler.rs Outdated Show resolved Hide resolved
control_plane/attachment_service/src/scheduler.rs Outdated Show resolved Hide resolved
control_plane/attachment_service/src/scheduler.rs Outdated Show resolved Hide resolved
control_plane/attachment_service/src/service.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a call about this, there will be a follow up that adds self consistency check.

Also, we currently never remove nodes from the hash map, so, worst case right now is bad scheduling decisions that can be fixed by restarting . Future bad scheduling decisions can be prevented by restarting the service, which straightens the reference counts for a moment.

@jcsp jcsp enabled auto-merge (squash) February 19, 2024 10:54
@jcsp jcsp merged commit 7e42809 into main Feb 19, 2024
49 checks passed
@jcsp jcsp deleted the jcsp/improved-scheduler-mk2-cutdown branch February 19, 2024 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage Component: storage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants