vm-monitor: Switch from memory.high to polling memory.stat #5524

sharnoff · 2023-10-11T06:55:44Z

tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be good enough, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.

Opened this quickly to (hopefully) save on time spent waiting on review. ~~Still need to (unfortunately) do some manual tests that this actually works as intended. Also still need to self-review the code.~~

Currently this PR polls every 100ms and uses a rolling average of the last 500ms to make decisions, periodically logging the last 2s of values collected. This could be improved, and I'm open to suggestions.

tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale.

github-actions · 2023-10-11T07:30:00Z

2304 tests run: 2194 passed, 0 failed, 110 skipped (full report)

Flaky tests (1)

Postgres 16

test_crafted_wal_end[wal_record_crossing_segment_followed_by_small_one]: debug

Code coverage (full report)

functions: 52.9% (8286 of 15672 functions)
lines: 80.6% (48323 of 59987 lines)

_{The comment gets automatically updated with the latest test results
51128bc at 2023-10-17T21:03:18.281Z :recycle:}

sharnoff · 2023-10-16T05:23:27Z

Tested this on staging by manually deleting & recreating a VM with the image built by CI, using this script that IIRC I've shared elsewhere:

#!/usr/bin/env bash

set -eu -o pipefail

if [ "$#" -ne 2 ]; then
    echo "USAGE: $0 <VirtualMachine name> <image>"
    exit 1
fi

vm="$1"
image="$2"

vm_json="$(kubectl get neonvm "$vm" -o json | jq --arg img "$image" '.spec.guest.rootDisk.image = $img')"

# delete the VM and recreate it
kubectl delete neonvm "$vm"
echo "$vm_json" | kubectl apply -f -
sh

Results from testing: I was unable to get "normal" pgbench to trigger upscaling — and memory seemed to stay at normal amounts. When I ran the synthetic "high CPU usage" pgbench query, upscaling and downscaling performed as normal.

Logs were far too verbose. It's currently logging recent memory stats about once every 0.5s. It should be more like 2-5s, at least.

At this point, need to test on locally with the autoscaling repo's tooling for synthetic memory usage, but presuming that passes, then we should be good.

problame · 2023-10-16T13:20:32Z

At this point, need to test on locally with the autoscaling repo's tooling for synthetic memory usage, but presuming that passes, then we should be good.

I assume theses tests happened?

problame · 2023-10-16T13:25:41Z

This needs updating:

neon/libs/vm_monitor/README.md

Lines 12 to 13 in 85d6d9d

    
           To control thresholds for receiving memory usage notifications, we start Postgres 
        
           in the `neon-postgres` cgroup and set its `memory.{max,high}`.

libs/vm_monitor/src/runner.rs

problame · 2023-10-16T13:29:44Z

libs/vm_monitor/src/runner.rs

+
+                // New memory stats from the cgroup, *may* need to request upscaling, if we've
+                // exceeded the threshold
+                result = self.cgroup.as_mut().unwrap().watcher.changed(), if self.cgroup.is_some() => {


The docs say

Additionally, each branch may include an optional if precondition. If the precondition returns false, then the branch is disabled. The provided is still evaluated but the resulting future is never polled. This capability is useful when using select! within a loop.

https://docs.rs/tokio/latest/tokio/macro.select.html

So, I guess the .unwrap() here would panci?

(I haven't used this feature of select! before.)

Generally, is it worth it to support the optionality of running without cgroups?

Oh yikes, I hadn't realized that before. And yeah, it generally isn't worth making the cgroups stuff optional, just a holdover from earlier designs.

Unfortunately, this bad tokio::select! is already the current behavior, which was implemented in #5441, so I'm inclined to ignore this for now and address it in a follow-up.
(Separately, there's significant refactoring that I'd like to do here [particularly in pursuit of neondatabase/cloud#6701], but haven't had the chance to get to that yet)

libs/vm_monitor/src/runner.rs

problame · 2023-10-16T13:53:07Z

Looked at the upscale path and left a few comments.

Is there any reason why this still supports the case where file cache sits on tmpfs? I thought we were past that 😮

Co-authored-by: Christian Schwarz <[email protected]>

sharnoff · 2023-10-17T06:08:21Z

I assume theses tests happened?

Only one of them had. The second one (manually testing by artificial memory usage) has now been done, and this PR works as expected. Testing process below:

Testing process

With a checked out autoscaling repo, change the Makefile's VM_MONITOR_BRANCH from main to sharnoff/vm-monitor-memory-polling.

Set up the cluster and prep the test VM's image:

make kind-setup && make deploy && make pg14-disk-test

Deploy the test VM:

kubectl apply -f vm-deploy.yaml

Open multiple windows to stream logs and current VM state. In one, run:

kubectl get -w neonvm

and in another, run:

kubectl logs -f -n kube-system -l name=autoscaler-agent | jq --unbuffered -r '"\(.ts | strftime("%Y-%m-%d %T")) \(.level) \(.logger): \(.msg) err=\(.error) stack=\(.stack)"'

Log into the VM with kubectl exec -it <pod name> -- screen /dev/pts/0, and run:

cgexec -g memory:neon-test allocate-loop 800 850

The VM should immediately upscale, and then ^C to kill allocate-loop should cause it to downscale shortly after (within a few seconds; but not as quickly as the upscaling).

Is there any reason why this still supports the case where file cache sits on tmpfs? I thought we were past that 😮

Yeah; rollout was gradual, via a per-region config flag, so even though the flag has been enabled for every region now (since last Wednesday), the old code has not yet been removed (partly because I wasn't sure that we wouldn't find issues and subsequently need to revert away from on-disk).

General plan in that regard is that removing support for tmpfs-backed file cache here will come after removing the config flag in the control plane, just out of an abundance of caution in case there's any misconfigurations.

problame

Took a look at downscale, turned out confused

problame · 2023-10-17T14:04:31Z

libs/vm_monitor/src/runner.rs

            let message = format!(
-                "set cgroup memory.high to {} MiB, of new max {} MiB",
-                bytes_to_mebibytes(new_cgroup_mem_high),
-                bytes_to_mebibytes(available_memory)
+                "set cgroup memory threshold from {} MiB to {} MiB, of new total {} MiB",
+                bytes_to_mebibytes(cgroup.threshold),
+                bytes_to_mebibytes(new_threshold),
+                bytes_to_mebibytes(usable_system_memory)
            );
+            cgroup.threshold = new_threshold;
            info!("downscale: {message}");
            status.push(message);


Where do we actually reconfigure the cgroup? Do we even still do that?

After this PR, the only interaction we have with the cgroup is to fetch memory statistics. I guess saying "set cgroup memory threshold" is a bit misleading in that sense

problame

Synced offline, good to go.

problame

Synced offline, good to go.

tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.

sharnoff requested review from problame and koivunej October 11, 2023 06:55

sharnoff marked this pull request as ready for review October 11, 2023 07:03

sharnoff requested review from a team as code owners October 11, 2023 07:03

sharnoff requested review from bojanserafimov and removed request for a team October 11, 2023 07:03

fix lint

28b6eb3

sharnoff mentioned this pull request Oct 12, 2023

Bug(ish): Kernel buff/cache usage prevents downscaling neondatabase/autoscaling#531

Closed

problame reviewed Oct 16, 2023

View reviewed changes

libs/vm_monitor/src/runner.rs Outdated Show resolved Hide resolved

problame reviewed Oct 16, 2023

View reviewed changes

libs/vm_monitor/src/runner.rs Show resolved Hide resolved

problame reviewed Oct 16, 2023

View reviewed changes

libs/vm_monitor/src/runner.rs Outdated Show resolved Hide resolved

problame reviewed Oct 16, 2023

View reviewed changes

libs/vm_monitor/src/runner.rs Outdated Show resolved Hide resolved

sharnoff and others added 7 commits October 16, 2023 19:39

log initial cgroup threshold

7866223

log stats less frequently than the rolling average window

c442b93

Update libs/vm_monitor/src/runner.rs

b27f79f

Co-authored-by: Christian Schwarz <[email protected]>

Update libs/vm_monitor/src/runner.rs

6216187

Co-authored-by: Christian Schwarz <[email protected]>

fix doc reference

4413276

adjust cgroup_threshold comment

a4b1d54

Co-authored-by: Christian Schwarz <[email protected]>

clarify cgroup_min_overhead_fraction usage comment in threshold fn

70bb84a

problame reviewed Oct 17, 2023

View reviewed changes

problame approved these changes Oct 17, 2023

View reviewed changes

sharnoff added 2 commits October 17, 2023 11:22

update readme

ddfdb89

improve cgroup_min_overhead_fraction comment

51128bc

sharnoff merged commit 9fe5cc6 into main Oct 17, 2023
34 checks passed

sharnoff deleted the sharnoff/vm-monitor-memory-polling branch October 17, 2023 22:30

This was referenced Oct 30, 2023

neonvm postgres gets stuck with 0 TPS when running pgbench #5678

Closed

vm-monitor: Replace filter_map with filter #5443

Closed

conradludgate mentioned this pull request Apr 11, 2024

Release proxy (2024-04-11 hotfix) #7366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm-monitor: Switch from memory.high to polling memory.stat #5524

vm-monitor: Switch from memory.high to polling memory.stat #5524

sharnoff commented Oct 11, 2023 •

edited

Loading

github-actions bot commented Oct 11, 2023 •

edited

Loading

Postgres 16

sharnoff commented Oct 16, 2023 •

edited

Loading

problame commented Oct 16, 2023

problame commented Oct 16, 2023

problame Oct 16, 2023

problame Oct 16, 2023

sharnoff Oct 17, 2023 •

edited

Loading

problame commented Oct 16, 2023

sharnoff commented Oct 17, 2023

problame left a comment

problame Oct 17, 2023

sharnoff Oct 17, 2023

problame left a comment

problame left a comment

vm-monitor: Switch from memory.high to polling memory.stat #5524

vm-monitor: Switch from memory.high to polling memory.stat #5524

Conversation

sharnoff commented Oct 11, 2023 • edited Loading

github-actions bot commented Oct 11, 2023 • edited Loading

2304 tests run: 2194 passed, 0 failed, 110 skipped (full report)

Postgres 16

Code coverage (full report)

sharnoff commented Oct 16, 2023 • edited Loading

problame commented Oct 16, 2023

problame commented Oct 16, 2023

problame Oct 16, 2023

Choose a reason for hiding this comment

problame Oct 16, 2023

Choose a reason for hiding this comment

sharnoff Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

problame commented Oct 16, 2023

sharnoff commented Oct 17, 2023

problame left a comment

Choose a reason for hiding this comment

problame Oct 17, 2023

Choose a reason for hiding this comment

sharnoff Oct 17, 2023

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

sharnoff commented Oct 11, 2023 •

edited

Loading

github-actions bot commented Oct 11, 2023 •

edited

Loading

sharnoff commented Oct 16, 2023 •

edited

Loading

sharnoff Oct 17, 2023 •

edited

Loading