Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm-monitor: Refactor scaling logic into CgroupWatcher #5488

Closed
wants to merge 1 commit into from

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Oct 6, 2023

The general idea of this PR is to move the on-downscale and on-upscale cgroup handling logic into into the CgroupWatcher itself via message passing of commands, rather than directly acting on the cgroup from the thread handling the websocket message.

This change is the large pre-requisite to a handful of smaller changes that should be much easier with this, all part of the Epic about fixing memory.high throttling (#5444):

  1. Fix a potential race condition wherein the logic that increases memory.high in response to memory.high events could overwrite newer (more permissive) values set by actual upscaling
    • Handled by this change already!
  2. Fix a bug where due to already increased memory.high to avoid throttling, upscaling actually decreases memory.high and leads to unrecoverable throttling.
  3. If memory.high has been increased to avoid throttling but no upscaling has happened, periodically try to decrease back to the desired memory.high.

For more general context, refer to #5444.


Remaining items before merging:

  • Self-review
  • Add (basic, at the very least) doc comments to added functions from this change
  • Run some manual tests using the autoscaling repo's setup

The general idea of this PR is to move the on-downscale and on-upscale
cgroup handling logic into into the CgroupWatcher itself via message
passing of commands, rather than directly acting on the cgroup from the
thread handling the websocket message.

This change is the large pre-requisite to a handful of smaller changes
that should be much easier with this, all part of the Epic about fixing
memory.high throttling (#5444):

1. Fix a potential race condition wherein the logic that increases
   memory.high in response to memory.high events could overwrite newer
   (more permissive) values set by actual upscaling
      - **Handled by this change already!**
2. Fix a bug where due to already increased memory.high to avoid
   throttling, upscaling actually decreases memory.high and leads to
   unrecoverable throttling.
3. If memory.high has been increased to avoid throttling but no
   upscaling has happened, periodically try to decrease back to the
   desired memory.high.

For more general context, refer to #5444.
@github-actions
Copy link

github-actions bot commented Oct 6, 2023

2250 tests run: 2134 passed, 0 failed, 116 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_tenant_detach_smoke: debug

Code coverage (full report)

  • functions: 52.4% (8120 of 15485 functions)
  • lines: 80.9% (47482 of 58719 lines)

The comment gets automatically updated with the latest test results
373eb7f at 2023-10-06T06:05:19.560Z :recycle:

sharnoff added a commit that referenced this pull request Oct 17, 2023
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
sharnoff added a commit that referenced this pull request Oct 20, 2023
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
@koivunej koivunej removed their request for review April 3, 2024 14:31
@koivunej
Copy link
Member

koivunej commented Apr 3, 2024

I think this PR I didn't have time to review before some vacation of mine, and it was since superceded by something else and this approach was abandoned.

@sharnoff
Copy link
Member Author

sharnoff commented Apr 3, 2024

Yeah, this was abandoned. vm-monitor could still use some heavy refactoring, but that should be a separate PR anyways.

@sharnoff sharnoff closed this Apr 3, 2024
@sharnoff sharnoff deleted the sharnoff/vm-monitor-race-free-cgroup branch April 3, 2024 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants