control_plane/attachment_service: improve Scheduler #6633

jcsp · 2024-02-05T14:12:58Z

Problem

One of the major shortcuts in the initial version of this code was to construct a fresh Scheduler each time we need it, which is an O(N^2) cost as the tenant count increases.

Summary of changes

Keep Scheduler alive through the lifetime of ServiceState
Use IntentState as a reference tracking helper, updating Scheduler refcounts as nodes are added/removed from the intent.

There is an automated test that checks things don't get pathologically slow with thousands of shards, but it's not included in this PR because tests that implicitly test the runner node performance take some thought to stabilize/land in CI.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-02-05T15:40:56Z

2442 tests run: 2324 passed, 0 failed, 118 skipped (full report)

Flaky tests (2)

Postgres 15

test_sharding_split_smoke: release

Postgres 14

test_sharding_split_smoke: release

Code coverage (full report)

functions: 55.8% (12941 of 23187 functions)
lines: 82.5% (70067 of 84925 lines)

_{The comment gets automatically updated with the latest test results
e68a8d1 at 2024-02-19T11:35:12.366Z :recycle:}

control_plane/attachment_service/src/tenant_state.rs

problame

The transition to get_attached() could have been a preliminary PR, would have reduced the noise quite a bit.

What are your plans for debuggability of the reference counting?
because I'm not at all confident that

I spotted any possible bugs and
We won't have refcounting errors in the future

What alternatives to manual refcounting exist & why were they not viable?

I'll have to spend more time reviewing the correctness of the reference counting once I know we're definitely going to take that route versus a TBD alternative.

control_plane/attachment_service/src/scheduler.rs

control_plane/attachment_service/src/service.rs

…uler-mk2-cutdown

problame

Had a call about this, there will be a follow up that adds self consistency check.

Also, we currently never remove nodes from the hash map, so, worst case right now is bad scheduling decisions ~~that can be fixed by restarting~~ . Future bad scheduling decisions can be prevented by restarting the service, which straightens the reference counts for a moment.

jcsp added the c/storage Component: storage label Feb 5, 2024

jcsp force-pushed the jcsp/improved-scheduler-mk2-cutdown branch from b97ff44 to cd73a97 Compare February 9, 2024 11:13

jcsp commented Feb 12, 2024

View reviewed changes

control_plane/attachment_service/src/tenant_state.rs Show resolved Hide resolved

jcsp added 4 commits February 12, 2024 12:56

control_plane/attachment_service: better Scheduler

acc6715

control_plane: logging improvements

9966e70

tests: revise tests that used nonexistent node IDs

1ab6b81

attachment_service: a hack for compatibilty tests

0b51a07

jcsp force-pushed the jcsp/improved-scheduler-mk2-cutdown branch from cd73a97 to 0b51a07 Compare February 12, 2024 12:56

jcsp marked this pull request as ready for review February 12, 2024 12:59

jcsp requested review from a team as code owners February 12, 2024 12:59

jcsp requested review from lubennikovaav and problame and removed request for a team and lubennikovaav February 12, 2024 12:59

update tenant_drop for scheduler changes

965180c

problame reviewed Feb 13, 2024

View reviewed changes

jcsp added 2 commits February 19, 2024 10:31

Merge remote-tracking branch 'upstream/main' into jcsp/improved-sched…

5ee830c

…uler-mk2-cutdown

Refactor node_upsert

6205116

problame approved these changes Feb 19, 2024

View reviewed changes

jcsp added 3 commits February 19, 2024 10:41

refactor Scheduler::new

ba81909

Refactor inc/dec

595fb09

refactor process_results

e68a8d1

jcsp enabled auto-merge (squash) February 19, 2024 10:54

jcsp merged commit 7e42809 into main Feb 19, 2024
49 checks passed

jcsp deleted the jcsp/improved-scheduler-mk2-cutdown branch February 19, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control_plane/attachment_service: improve Scheduler #6633

control_plane/attachment_service: improve Scheduler #6633

jcsp commented Feb 5, 2024

github-actions bot commented Feb 5, 2024 •

edited

Loading

Postgres 15

Postgres 14

problame left a comment

problame left a comment •

edited

Loading

control_plane/attachment_service: improve Scheduler #6633

control_plane/attachment_service: improve Scheduler #6633

Conversation

jcsp commented Feb 5, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Feb 5, 2024 • edited Loading

2442 tests run: 2324 passed, 0 failed, 118 skipped (full report)

Postgres 15

Postgres 14

Code coverage (full report)

problame left a comment

Choose a reason for hiding this comment

problame left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Feb 5, 2024 •

edited

Loading

problame left a comment •

edited

Loading