-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vm-monitor: Switch from memory.high to polling memory.stat #5524
Conversation
tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale.
2304 tests run: 2194 passed, 0 failed, 110 skipped (full report)Flaky tests (1)Postgres 16
Code coverage (full report)
The comment gets automatically updated with the latest test results
51128bc at 2023-10-17T21:03:18.281Z :recycle: |
Tested this on staging by manually deleting & recreating a VM with the image built by CI, using this script that IIRC I've shared elsewhere: #!/usr/bin/env bash
set -eu -o pipefail
if [ "$#" -ne 2 ]; then
echo "USAGE: $0 <VirtualMachine name> <image>"
exit 1
fi
vm="$1"
image="$2"
vm_json="$(kubectl get neonvm "$vm" -o json | jq --arg img "$image" '.spec.guest.rootDisk.image = $img')"
# delete the VM and recreate it
kubectl delete neonvm "$vm"
echo "$vm_json" | kubectl apply -f -
sh Results from testing: I was unable to get "normal" pgbench to trigger upscaling — and memory seemed to stay at normal amounts. When I ran the synthetic "high CPU usage" pgbench query, upscaling and downscaling performed as normal. Logs were far too verbose. It's currently logging recent memory stats about once every 0.5s. It should be more like 2-5s, at least. At this point, need to test on locally with the autoscaling repo's tooling for synthetic memory usage, but presuming that passes, then we should be good. |
I assume theses tests happened? |
This needs updating: neon/libs/vm_monitor/README.md Lines 12 to 13 in 85d6d9d
|
|
||
// New memory stats from the cgroup, *may* need to request upscaling, if we've | ||
// exceeded the threshold | ||
result = self.cgroup.as_mut().unwrap().watcher.changed(), if self.cgroup.is_some() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs say
Additionally, each branch may include an optional if precondition. If the precondition returns false, then the branch is disabled. The provided is still evaluated but the resulting future is never polled. This capability is useful when using select! within a loop.
https://docs.rs/tokio/latest/tokio/macro.select.html
So, I guess the .unwrap()
here would panci?
(I haven't used this feature of select! before.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, is it worth it to support the optionality of running without cgroups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yikes, I hadn't realized that before. And yeah, it generally isn't worth making the cgroups stuff optional, just a holdover from earlier designs.
Unfortunately, this bad tokio::select!
is already the current behavior, which was implemented in #5441, so I'm inclined to ignore this for now and address it in a follow-up.
(Separately, there's significant refactoring that I'd like to do here [particularly in pursuit of neondatabase/cloud#6701], but haven't had the chance to get to that yet)
Looked at the upscale path and left a few comments. Is there any reason why this still supports the case where file cache sits on tmpfs? I thought we were past that 😮 |
Co-authored-by: Christian Schwarz <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
Only one of them had. The second one (manually testing by artificial memory usage) has now been done, and this PR works as expected. Testing process below: Testing processWith a checked out autoscaling repo, change the Makefile's Set up the cluster and prep the test VM's image: make kind-setup && make deploy && make pg14-disk-test Deploy the test VM: kubectl apply -f vm-deploy.yaml Open multiple windows to stream logs and current VM state. In one, run: kubectl get -w neonvm and in another, run: kubectl logs -f -n kube-system -l name=autoscaler-agent | jq --unbuffered -r '"\(.ts | strftime("%Y-%m-%d %T")) \(.level) \(.logger): \(.msg) err=\(.error) stack=\(.stack)"' Log into the VM with cgexec -g memory:neon-test allocate-loop 800 850 The VM should immediately upscale, and then
Yeah; rollout was gradual, via a per-region config flag, so even though the flag has been enabled for every region now (since last Wednesday), the old code has not yet been removed (partly because I wasn't sure that we wouldn't find issues and subsequently need to revert away from on-disk). General plan in that regard is that removing support for tmpfs-backed file cache here will come after removing the config flag in the control plane, just out of an abundance of caution in case there's any misconfigurations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at downscale, turned out confused
let message = format!( | ||
"set cgroup memory.high to {} MiB, of new max {} MiB", | ||
bytes_to_mebibytes(new_cgroup_mem_high), | ||
bytes_to_mebibytes(available_memory) | ||
"set cgroup memory threshold from {} MiB to {} MiB, of new total {} MiB", | ||
bytes_to_mebibytes(cgroup.threshold), | ||
bytes_to_mebibytes(new_threshold), | ||
bytes_to_mebibytes(usable_system_memory) | ||
); | ||
cgroup.threshold = new_threshold; | ||
info!("downscale: {message}"); | ||
status.push(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we actually reconfigure the cgroup? Do we even still do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR, the only interaction we have with the cgroup is to fetch memory statistics. I guess saying "set cgroup memory threshold" is a bit misleading in that sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline, good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline, good to go.
tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.
tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of.
In the interest of fixing things quickly with something that should be good enough, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale.
This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.
Opened this quickly to (hopefully) save on time spent waiting on review.
Still need to (unfortunately) do some manual tests that this actually works as intended. Also still need to self-review the code.Currently this PR polls every 100ms and uses a rolling average of the last 500ms to make decisions, periodically logging the last 2s of values collected. This could be improved, and I'm open to suggestions.