-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control_plane/attachment_service: improve Scheduler #6633
Conversation
2442 tests run: 2324 passed, 0 failed, 118 skipped (full report)Flaky tests (2)Postgres 15
Postgres 14
Code coverage (full report)
The comment gets automatically updated with the latest test results
e68a8d1 at 2024-02-19T11:35:12.366Z :recycle: |
b97ff44
to
cd73a97
Compare
cd73a97
to
0b51a07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transition to get_attached()
could have been a preliminary PR, would have reduced the noise quite a bit.
What are your plans for debuggability of the reference counting?
because I'm not at all confident that
- I spotted any possible bugs and
- We won't have refcounting errors in the future
What alternatives to manual refcounting exist & why were they not viable?
I'll have to spend more time reviewing the correctness of the reference counting once I know we're definitely going to take that route versus a TBD alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a call about this, there will be a follow up that adds self consistency check.
Also, we currently never remove nodes from the hash map, so, worst case right now is bad scheduling decisions that can be fixed by restarting . Future bad scheduling decisions can be prevented by restarting the service, which straightens the reference counts for a moment.
Problem
One of the major shortcuts in the initial version of this code was to construct a fresh
Scheduler
each time we need it, which is an O(N^2) cost as the tenant count increases.Summary of changes
Scheduler
alive through the lifetime of ServiceStateIntentState
as a reference tracking helper, updating Scheduler refcounts as nodes are added/removed from the intent.There is an automated test that checks things don't get pathologically slow with thousands of shards, but it's not included in this PR because tests that implicitly test the runner node performance take some thought to stabilize/land in CI.
Checklist before requesting a review
Checklist before merging