-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: Fix vm-monitor cgroup memory.high throttling #5444
Comments
I think we need a bigger redesign of memory autoscaling. It's never OK for the OOM killer to kill PostgreSQL. It's trivial for a SQL query to request an arbitrary amount of memory, and that must not lead to the whole server being restarted. The correct response is to get a graceful "ERROR: out of memory" in the offending backend, without affecting the rest of the system. Currently, with a 0.25 CU endpoint:
That is not acceptable. I believe with cgroup memory limit, the only possible responses to reaching the memory limit is for the OOM killer to kill the process, or to pause the process. Neither of those is what we want. We need to find a different mechanism that does not use cgroups. |
I agree with you. I think what you're discussing is orthogonal to this issue: When we hit memory.high, there's no strict requirements about what we do. We previously had a cgroup-level memory limit (via memory.max), and that was recently removed in #5333. |
Hmm, so why is Postgres still getting killed when you run a query lie that? Oh, it's because we allow overcommit:
We should disable that ( |
The general idea of this PR is to move the on-downscale and on-upscale cgroup handling logic into into the CgroupWatcher itself via message passing of commands, rather than directly acting on the cgroup from the thread handling the websocket message. This change is the large pre-requisite to a handful of smaller changes that should be much easier with this, all part of the Epic about fixing memory.high throttling (#5444): 1. Fix a potential race condition wherein the logic that increases memory.high in response to memory.high events could overwrite newer (more permissive) values set by actual upscaling - **Handled by this change already!** 2. Fix a bug where due to already increased memory.high to avoid throttling, upscaling actually decreases memory.high and leads to unrecoverable throttling. 3. If memory.high has been increased to avoid throttling but no upscaling has happened, periodically try to decrease back to the desired memory.high. For more general context, refer to #5444.
tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.
tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.
Marked as epic because this has been ongoing for a little while, and I expect it will take another ~1.5 weeks to completely resolve, due to size of the initial PR, and time to review follow-ups.
Technical background
Neon's autoscaling feature relies on a handful of components managed by the autoscaling team. One of these is the
vm-monitor
(defined inlibs/vm_monitor
) and runs inside each VM. It can be provided either as a binary (for the autoscaling repo's CI) or embedded into an existing tokio runtime (like it is with compute_ctl).One way we provide better guarantees around the speed of upscaling is by running postgres in its own cgroup and listening for "memory.high events" on that cgroup. This allows the vm-monitor to be notified ~immediately when a memory threshold is exceeded, so we can make timely upscaling requests without excessively polling.
These memory.high events are represented as changes to
high
field of the cgroup'smemory.events
file, which generates the appropriate modification events, etc. In order to generate these events, the cgroup must have thememory.high
value set to some number of bytes (and then, exceeding that value will generate events, etc.)Beyond generating events, the primary purpose of memory.high is actually to provide a threshold above which the kernel should start reclaiming memory from processes in the cgroup and throttling all processes in the cgroup.
It turns out that this throttling is actually independent of the memory reclamation and, if left unchecked for more than a couple seconds, quickly becomes quite severe (on the order of 1000x slower).
Historical context
At time of writing, the vm-monitor is actually a relatively recent creation, first used via neondatabase/autoscaling#362 (2023-07-03), and finally taking over from its predecessor with neondatabase/autoscaling#442 (2023-07-17).
Before the vm-monitor, we had the vm-informant (originally introduced 2023-01-10, with neondatabase/autoscaling#8), which performed the exact same role, but was written in Go, and developed inside the autoscaling repo.
The vm-informant had implemented a similar cgroup event-based upscaling mechanism, and we had lightly tested it, but actually enabling it required a small fix to how VMs were created by the control plane, and e2e tests on those changes always had opaque failures for reasons that were unclear at the time (see neondatabase/cloud#4557 and https://github.com/neondatabase/cloud/pull/5143#issuecomment-1575689390).
Nevertheless, the vm-informant's original implementation was found to heavily throttle the postgres cgroup if the VM wasn't upscaled (so severely that we originally thought the cgroup was frozen). This was due exclusively to the throttling by the kernel from memory.high itself, so we changed the vm-informant to gradually increase memory.high when we hit it, to avoid this throttling, in neondatabase/autoscaling#223 (2023-05-15). See also: the original issue.
The original implementation of the vm-monitor did not have this "gradually increase memory.high" logic, and after we finally first enabled the cgroup integration using the vm-monitor (#4920), we started seeing VMs stalling due to the same memory.high throttling that previously existed with the vm-informant:
Typical symptoms were that it took ~20s to connect to postgres with psql from inside the VM, and
\d+
took longer than we cared to wait for (> 2 minutes).We initially fixed this by reimplementing the vm-informant's updated logic (see #5303; every time we get a memory.high event, increase memory.high by a small amount), but it turned out this wasn't a complete fix, due to some remaining quirks - some fixed, and some work still required (more detail below).
This issue exists to track the rest of the work required to fix any remaining instances of this throttling.
Planned changes
At a high level, there's two pieces of work remaining, at time of writing (check the task list below). There's also a secret third piece that's blocking the other two.
First, we need to fix the bug that allows memory.high to be decreased when the vm-monitor is notified of increases to the VM's memory size. Ideally this would be fairly straightforward, but there's some existing raciness between the logic handling upscaling/downscaling and the cgroup watcher's own "increase memory.high if we need to" that should be resolved first (or, alongside it) — in short, they both touch memory.high.
Second: We need to provide a mechanism outside downscaling to decrease the cgroup's memory.high value. Periodically decreasing memory.high if we can (assuming current memory usage is some threshold below target value) should suffice here. The cgroup event watcher is already responsible for increasing memory.high when we hit it to avoid throttling, and so it's kind of the ideal choice for managing this as well.
Therefore, before either of these, we need to move memory.high handling into the cgroup event watcher with the end result being that:
In addition to fixing the existing raciness, that should make the rest of this much simpler to implement.
Tasks
Tasks
The text was updated successfully, but these errors were encountered: