Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm-monitor: Switch from memory.high to polling memory.stat #5524

Merged
merged 11 commits into from
Oct 17, 2023

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Oct 11, 2023

tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be good enough, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.


Opened this quickly to (hopefully) save on time spent waiting on review. Still need to (unfortunately) do some manual tests that this actually works as intended. Also still need to self-review the code.

Currently this PR polls every 100ms and uses a rolling average of the last 500ms to make decisions, periodically logging the last 2s of values collected. This could be improved, and I'm open to suggestions.

tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.
@sharnoff sharnoff requested review from problame and koivunej October 11, 2023 06:55
@sharnoff sharnoff marked this pull request as ready for review October 11, 2023 07:03
@sharnoff sharnoff requested review from a team as code owners October 11, 2023 07:03
@sharnoff sharnoff requested review from bojanserafimov and removed request for a team October 11, 2023 07:03
@github-actions
Copy link

github-actions bot commented Oct 11, 2023

2304 tests run: 2194 passed, 0 failed, 110 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_crafted_wal_end[wal_record_crossing_segment_followed_by_small_one]: debug

Code coverage (full report)

  • functions: 52.9% (8286 of 15672 functions)
  • lines: 80.6% (48323 of 59987 lines)

The comment gets automatically updated with the latest test results
51128bc at 2023-10-17T21:03:18.281Z :recycle:

@sharnoff
Copy link
Member Author

sharnoff commented Oct 16, 2023

Tested this on staging by manually deleting & recreating a VM with the image built by CI, using this script that IIRC I've shared elsewhere:

#!/usr/bin/env bash

set -eu -o pipefail

if [ "$#" -ne 2 ]; then
    echo "USAGE: $0 <VirtualMachine name> <image>"
    exit 1
fi

vm="$1"
image="$2"

vm_json="$(kubectl get neonvm "$vm" -o json | jq --arg img "$image" '.spec.guest.rootDisk.image = $img')"

# delete the VM and recreate it
kubectl delete neonvm "$vm"
echo "$vm_json" | kubectl apply -f -
sh

Results from testing: I was unable to get "normal" pgbench to trigger upscaling — and memory seemed to stay at normal amounts. When I ran the synthetic "high CPU usage" pgbench query, upscaling and downscaling performed as normal.

Logs were far too verbose. It's currently logging recent memory stats about once every 0.5s. It should be more like 2-5s, at least.

At this point, need to test on locally with the autoscaling repo's tooling for synthetic memory usage, but presuming that passes, then we should be good.

@problame
Copy link
Contributor

At this point, need to test on locally with the autoscaling repo's tooling for synthetic memory usage, but presuming that passes, then we should be good.

I assume theses tests happened?

@problame
Copy link
Contributor

This needs updating:

To control thresholds for receiving memory usage notifications, we start Postgres
in the `neon-postgres` cgroup and set its `memory.{max,high}`.


// New memory stats from the cgroup, *may* need to request upscaling, if we've
// exceeded the threshold
result = self.cgroup.as_mut().unwrap().watcher.changed(), if self.cgroup.is_some() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say

Additionally, each branch may include an optional if precondition. If the precondition returns false, then the branch is disabled. The provided is still evaluated but the resulting future is never polled. This capability is useful when using select! within a loop.

https://docs.rs/tokio/latest/tokio/macro.select.html

So, I guess the .unwrap() here would panci?

(I haven't used this feature of select! before.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, is it worth it to support the optionality of running without cgroups?

Copy link
Member Author

@sharnoff sharnoff Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yikes, I hadn't realized that before. And yeah, it generally isn't worth making the cgroups stuff optional, just a holdover from earlier designs.

Unfortunately, this bad tokio::select! is already the current behavior, which was implemented in #5441, so I'm inclined to ignore this for now and address it in a follow-up.
(Separately, there's significant refactoring that I'd like to do here [particularly in pursuit of neondatabase/cloud#6701], but haven't had the chance to get to that yet)

@problame
Copy link
Contributor

Looked at the upscale path and left a few comments.

Is there any reason why this still supports the case where file cache sits on tmpfs? I thought we were past that 😮

@sharnoff
Copy link
Member Author

I assume theses tests happened?

Only one of them had. The second one (manually testing by artificial memory usage) has now been done, and this PR works as expected. Testing process below:

Testing process

With a checked out autoscaling repo, change the Makefile's VM_MONITOR_BRANCH from main to sharnoff/vm-monitor-memory-polling.

Set up the cluster and prep the test VM's image:

make kind-setup && make deploy && make pg14-disk-test

Deploy the test VM:

kubectl apply -f vm-deploy.yaml

Open multiple windows to stream logs and current VM state. In one, run:

kubectl get -w neonvm

and in another, run:

kubectl logs -f -n kube-system -l name=autoscaler-agent | jq --unbuffered -r '"\(.ts | strftime("%Y-%m-%d %T")) \(.level) \(.logger): \(.msg) err=\(.error) stack=\(.stack)"'

Log into the VM with kubectl exec -it <pod name> -- screen /dev/pts/0, and run:

cgexec -g memory:neon-test allocate-loop 800 850

The VM should immediately upscale, and then ^C to kill allocate-loop should cause it to downscale shortly after (within a few seconds; but not as quickly as the upscaling).

Is there any reason why this still supports the case where file cache sits on tmpfs? I thought we were past that 😮

Yeah; rollout was gradual, via a per-region config flag, so even though the flag has been enabled for every region now (since last Wednesday), the old code has not yet been removed (partly because I wasn't sure that we wouldn't find issues and subsequently need to revert away from on-disk).

General plan in that regard is that removing support for tmpfs-backed file cache here will come after removing the config flag in the control plane, just out of an abundance of caution in case there's any misconfigurations.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at downscale, turned out confused

Comment on lines 299 to 307
let message = format!(
"set cgroup memory.high to {} MiB, of new max {} MiB",
bytes_to_mebibytes(new_cgroup_mem_high),
bytes_to_mebibytes(available_memory)
"set cgroup memory threshold from {} MiB to {} MiB, of new total {} MiB",
bytes_to_mebibytes(cgroup.threshold),
bytes_to_mebibytes(new_threshold),
bytes_to_mebibytes(usable_system_memory)
);
cgroup.threshold = new_threshold;
info!("downscale: {message}");
status.push(message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we actually reconfigure the cgroup? Do we even still do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, the only interaction we have with the cgroup is to fetch memory statistics. I guess saying "set cgroup memory threshold" is a bit misleading in that sense

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline, good to go.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline, good to go.

@sharnoff sharnoff merged commit 9fe5cc6 into main Oct 17, 2023
34 checks passed
@sharnoff sharnoff deleted the sharnoff/vm-monitor-memory-polling branch October 17, 2023 22:30
sharnoff added a commit that referenced this pull request Oct 20, 2023
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Epic: Fix vm-monitor cgroup memory.high throttling
2 participants