Epic: Fix vm-monitor cgroup memory.high throttling #5444

sharnoff · 2023-10-03T04:16:29Z

Marked as epic because this has been ongoing for a little while, and I expect it will take another ~1.5 weeks to completely resolve, due to size of the initial PR, and time to review follow-ups.

Technical background

Neon's autoscaling feature relies on a handful of components managed by the autoscaling team. One of these is the vm-monitor (defined in libs/vm_monitor) and runs inside each VM. It can be provided either as a binary (for the autoscaling repo's CI) or embedded into an existing tokio runtime (like it is with compute_ctl).

One way we provide better guarantees around the speed of upscaling is by running postgres in its own cgroup and listening for "memory.high events" on that cgroup. This allows the vm-monitor to be notified ~immediately when a memory threshold is exceeded, so we can make timely upscaling requests without excessively polling.

These memory.high events are represented as changes to high field of the cgroup's memory.events file, which generates the appropriate modification events, etc. In order to generate these events, the cgroup must have the memory.high value set to some number of bytes (and then, exceeding that value will generate events, etc.)

Beyond generating events, the primary purpose of memory.high is actually to provide a threshold above which the kernel should start reclaiming memory from processes in the cgroup and throttling all processes in the cgroup.

It turns out that this throttling is actually independent of the memory reclamation and, if left unchecked for more than a couple seconds, quickly becomes quite severe (on the order of 1000x slower).

Historical context

At time of writing, the vm-monitor is actually a relatively recent creation, first used via neondatabase/autoscaling#362 (2023-07-03), and finally taking over from its predecessor with neondatabase/autoscaling#442 (2023-07-17).

Before the vm-monitor, we had the vm-informant (originally introduced 2023-01-10, with neondatabase/autoscaling#8), which performed the exact same role, but was written in Go, and developed inside the autoscaling repo.

The vm-informant had implemented a similar cgroup event-based upscaling mechanism, and we had lightly tested it, but actually enabling it required a small fix to how VMs were created by the control plane, and e2e tests on those changes always had opaque failures for reasons that were unclear at the time (see neondatabase/cloud#4557 and https://github.com/neondatabase/cloud/pull/5143#issuecomment-1575689390).

Nevertheless, the vm-informant's original implementation was found to heavily throttle the postgres cgroup if the VM wasn't upscaled (so severely that we originally thought the cgroup was frozen). This was due exclusively to the throttling by the kernel from memory.high itself, so we changed the vm-informant to gradually increase memory.high when we hit it, to avoid this throttling, in neondatabase/autoscaling#223 (2023-05-15). See also: the original issue.

The original implementation of the vm-monitor did not have this "gradually increase memory.high" logic, and after we finally first enabled the cgroup integration using the vm-monitor (#4920), we started seeing VMs stalling due to the same memory.high throttling that previously existed with the vm-informant:

Typical symptoms were that it took ~20s to connect to postgres with psql from inside the VM, and \d+ took longer than we cared to wait for (> 2 minutes).

We initially fixed this by reimplementing the vm-informant's updated logic (see #5303; every time we get a memory.high event, increase memory.high by a small amount), but it turned out this wasn't a complete fix, due to some remaining quirks - some fixed, and some work still required (more detail below).

This issue exists to track the rest of the work required to fix any remaining instances of this throttling.

Planned changes

At a high level, there's two pieces of work remaining, at time of writing (check the task list below). There's also a secret third piece that's blocking the other two.

First, we need to fix the bug that allows memory.high to be decreased when the vm-monitor is notified of increases to the VM's memory size. Ideally this would be fairly straightforward, but there's some existing raciness between the logic handling upscaling/downscaling and the cgroup watcher's own "increase memory.high if we need to" that should be resolved first (or, alongside it) — in short, they both touch memory.high.

Second: We need to provide a mechanism outside downscaling to decrease the cgroup's memory.high value. Periodically decreasing memory.high if we can (assuming current memory usage is some threshold below target value) should suffice here. The cgroup event watcher is already responsible for increasing memory.high when we hit it to avoid throttling, and so it's kind of the ideal choice for managing this as well.

Therefore, before either of these, we need to move memory.high handling into the cgroup event watcher with the end result being that:

It's responsible for approving downscaling (or not); and
Responding to an upscaling message blocks on the cgroup event watcher executing the change

In addition to fixing the existing raciness, that should make the rest of this much simpler to implement.

Tasks

Give feedback

vm-monitor: Fix cgroup throttling #5303
vm-monitor: Unset memory.high on start + refactor cgroup handling #5348
vm-monitor: Refactor scaling logic into CgroupWatcher #5488
bugfix: Don't decrease memory.high on upscale
bugfix: Periodically try to decrease memory.high if it was previously artificially increased
Options

The text was updated successfully, but these errors were encountered:

hlinnaka · 2023-10-03T08:11:05Z

I think we need a bigger redesign of memory autoscaling.

It's never OK for the OOM killer to kill PostgreSQL. It's trivial for a SQL query to request an arbitrary amount of memory, and that must not lead to the whole server being restarted. The correct response is to get a graceful "ERROR: out of memory" in the offending backend, without affecting the rest of the system.

Currently, with a 0.25 CU endpoint:

neondb=> select repeat('x', 900000000);
SSL connection has been closed unexpectedly
The connection to the server was lost. Attempting reset: Succeeded.

That is not acceptable.

I believe with cgroup memory limit, the only possible responses to reaching the memory limit is for the OOM killer to kill the process, or to pause the process. Neither of those is what we want. We need to find a different mechanism that does not use cgroups.

sharnoff · 2023-10-03T13:57:50Z

It's never OK for the OOM killer to kill PostgreSQL

[ ... ]

I believe with cgroup memory limit, the only possible responses to reaching the memory limit is for the OOM killer to kill the process, or to pause the process

I agree with you. I think what you're discussing is orthogonal to this issue: When we hit memory.high, there's no strict requirements about what we do. We previously had a cgroup-level memory limit (via memory.max), and that was recently removed in #5333.

hlinnaka · 2023-10-03T18:48:52Z

It's never OK for the OOM killer to kill PostgreSQL
[ ... ]
I believe with cgroup memory limit, the only possible responses to reaching the memory limit is for the OOM killer to kill the process, or to pause the process

I agree with you. I think what you're discussing is orthogonal to this issue: When we hit memory.high, there's no strict requirements about what we do. We previously had a cgroup-level memory limit (via memory.max), and that was recently removed in #5333.

Hmm, so why is Postgres still getting killed when you run a query lie that?

Oh, it's because we allow overcommit:

root@compute-snowy-breeze-24803104-pjghf:~# cat /proc/sys/vm/overcommit_memory 
0

We should disable that (echo 2 > /proc/sys/vm/overcommit_memory), but I agree that's a different problem then.

The general idea of this PR is to move the on-downscale and on-upscale cgroup handling logic into into the CgroupWatcher itself via message passing of commands, rather than directly acting on the cgroup from the thread handling the websocket message. This change is the large pre-requisite to a handful of smaller changes that should be much easier with this, all part of the Epic about fixing memory.high throttling (#5444): 1. Fix a potential race condition wherein the logic that increases memory.high in response to memory.high events could overwrite newer (more permissive) values set by actual upscaling - **Handled by this change already!** 2. Fix a bug where due to already increased memory.high to avoid throttling, upscaling actually decreases memory.high and leads to unrecoverable throttling. 3. If memory.high has been increased to avoid throttling but no upscaling has happened, periodically try to decrease back to the desired memory.high. For more general context, refer to #5444.

tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.

sharnoff added t/bug Issue Type: Bug t/Epic Issue type: Epic c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) labels Oct 3, 2023

sharnoff self-assigned this Oct 3, 2023

This was referenced Oct 6, 2023

vm-monitor: Refactor scaling logic into CgroupWatcher #5488

Closed

vm-monitor: Switch from memory.high to polling memory.stat #5524

Merged

sharnoff closed this as completed in #5524 Oct 17, 2023

sharnoff mentioned this issue Oct 20, 2023

Release 2023-10-20: vm-monitor memory.high throttling fixes #5610

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Fix vm-monitor cgroup memory.high throttling #5444

Epic: Fix vm-monitor cgroup memory.high throttling #5444

sharnoff commented Oct 3, 2023 •

edited

Loading

Tasks

hlinnaka commented Oct 3, 2023 •

edited

Loading

sharnoff commented Oct 3, 2023

hlinnaka commented Oct 3, 2023

Epic: Fix vm-monitor cgroup memory.high throttling #5444

Epic: Fix vm-monitor cgroup memory.high throttling #5444

Comments

sharnoff commented Oct 3, 2023 • edited Loading

Technical background

Historical context

Planned changes

Tasks

Tasks

hlinnaka commented Oct 3, 2023 • edited Loading

sharnoff commented Oct 3, 2023

hlinnaka commented Oct 3, 2023

sharnoff commented Oct 3, 2023 •

edited

Loading

hlinnaka commented Oct 3, 2023 •

edited

Loading