vm-monitor: Refactor scaling logic into CgroupWatcher #5488

sharnoff · 2023-10-06T05:33:14Z

The general idea of this PR is to move the on-downscale and on-upscale cgroup handling logic into into the CgroupWatcher itself via message passing of commands, rather than directly acting on the cgroup from the thread handling the websocket message.

This change is the large pre-requisite to a handful of smaller changes that should be much easier with this, all part of the Epic about fixing memory.high throttling (#5444):

Fix a potential race condition wherein the logic that increases memory.high in response to memory.high events could overwrite newer (more permissive) values set by actual upscaling
- Handled by this change already!
Fix a bug where due to already increased memory.high to avoid throttling, upscaling actually decreases memory.high and leads to unrecoverable throttling.
If memory.high has been increased to avoid throttling but no upscaling has happened, periodically try to decrease back to the desired memory.high.

For more general context, refer to #5444.

Remaining items before merging:

Self-review
Add (basic, at the very least) doc comments to added functions from this change
Run some manual tests using the autoscaling repo's setup

The general idea of this PR is to move the on-downscale and on-upscale cgroup handling logic into into the CgroupWatcher itself via message passing of commands, rather than directly acting on the cgroup from the thread handling the websocket message. This change is the large pre-requisite to a handful of smaller changes that should be much easier with this, all part of the Epic about fixing memory.high throttling (#5444): 1. Fix a potential race condition wherein the logic that increases memory.high in response to memory.high events could overwrite newer (more permissive) values set by actual upscaling - **Handled by this change already!** 2. Fix a bug where due to already increased memory.high to avoid throttling, upscaling actually decreases memory.high and leads to unrecoverable throttling. 3. If memory.high has been increased to avoid throttling but no upscaling has happened, periodically try to decrease back to the desired memory.high. For more general context, refer to #5444.

github-actions · 2023-10-06T06:05:20Z

2250 tests run: 2134 passed, 0 failed, 116 skipped (full report)

Flaky tests (1)

Postgres 14

test_tenant_detach_smoke: debug

Code coverage (full report)

functions: 52.4% (8120 of 15485 functions)
lines: 80.9% (47482 of 58719 lines)

_{The comment gets automatically updated with the latest test results
373eb7f at 2023-10-06T06:05:19.560Z :recycle:}

tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.

koivunej · 2024-04-03T14:32:30Z

I think this PR I didn't have time to review before some vacation of mine, and it was since superceded by something else and this approach was abandoned.

sharnoff · 2024-04-03T15:06:39Z

Yeah, this was abandoned. vm-monitor could still use some heavy refactoring, but that should be a separate PR anyways.

sharnoff requested a review from koivunej October 6, 2023 05:33

sharnoff mentioned this pull request Oct 6, 2023

Epic: Fix vm-monitor cgroup memory.high throttling #5444

Closed

sharnoff mentioned this pull request Oct 11, 2023

vm-monitor: Switch from memory.high to polling memory.stat #5524

Merged

koivunej removed their request for review April 3, 2024 14:31

sharnoff closed this Apr 3, 2024

sharnoff deleted the sharnoff/vm-monitor-race-free-cgroup branch April 3, 2024 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm-monitor: Refactor scaling logic into CgroupWatcher #5488

vm-monitor: Refactor scaling logic into CgroupWatcher #5488

sharnoff commented Oct 6, 2023

github-actions bot commented Oct 6, 2023

Postgres 14

koivunej commented Apr 3, 2024

sharnoff commented Apr 3, 2024

vm-monitor: Refactor scaling logic into CgroupWatcher #5488

vm-monitor: Refactor scaling logic into CgroupWatcher #5488

Conversation

sharnoff commented Oct 6, 2023

github-actions bot commented Oct 6, 2023

2250 tests run: 2134 passed, 0 failed, 116 skipped (full report)

Postgres 14

Code coverage (full report)

koivunej commented Apr 3, 2024

sharnoff commented Apr 3, 2024