Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Fix vm-monitor cgroup memory.high throttling #5444

Closed
3 of 5 tasks
sharnoff opened this issue Oct 3, 2023 · 3 comments · Fixed by #5524
Closed
3 of 5 tasks

Epic: Fix vm-monitor cgroup memory.high throttling #5444

sharnoff opened this issue Oct 3, 2023 · 3 comments · Fixed by #5524
Assignees
Labels
c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) t/bug Issue Type: Bug t/Epic Issue type: Epic

Comments

@sharnoff
Copy link
Member

sharnoff commented Oct 3, 2023

Marked as epic because this has been ongoing for a little while, and I expect it will take another ~1.5 weeks to completely resolve, due to size of the initial PR, and time to review follow-ups.

Technical background

Neon's autoscaling feature relies on a handful of components managed by the autoscaling team. One of these is the vm-monitor (defined in libs/vm_monitor) and runs inside each VM. It can be provided either as a binary (for the autoscaling repo's CI) or embedded into an existing tokio runtime (like it is with compute_ctl).

One way we provide better guarantees around the speed of upscaling is by running postgres in its own cgroup and listening for "memory.high events" on that cgroup. This allows the vm-monitor to be notified ~immediately when a memory threshold is exceeded, so we can make timely upscaling requests without excessively polling.

These memory.high events are represented as changes to high field of the cgroup's memory.events file, which generates the appropriate modification events, etc. In order to generate these events, the cgroup must have the memory.high value set to some number of bytes (and then, exceeding that value will generate events, etc.)

Beyond generating events, the primary purpose of memory.high is actually to provide a threshold above which the kernel should start reclaiming memory from processes in the cgroup and throttling all processes in the cgroup.

It turns out that this throttling is actually independent of the memory reclamation and, if left unchecked for more than a couple seconds, quickly becomes quite severe (on the order of 1000x slower).

Historical context

At time of writing, the vm-monitor is actually a relatively recent creation, first used via neondatabase/autoscaling#362 (2023-07-03), and finally taking over from its predecessor with neondatabase/autoscaling#442 (2023-07-17).

Before the vm-monitor, we had the vm-informant (originally introduced 2023-01-10, with neondatabase/autoscaling#8), which performed the exact same role, but was written in Go, and developed inside the autoscaling repo.

The vm-informant had implemented a similar cgroup event-based upscaling mechanism, and we had lightly tested it, but actually enabling it required a small fix to how VMs were created by the control plane, and e2e tests on those changes always had opaque failures for reasons that were unclear at the time (see neondatabase/cloud#4557 and https://github.com/neondatabase/cloud/pull/5143#issuecomment-1575689390).

Nevertheless, the vm-informant's original implementation was found to heavily throttle the postgres cgroup if the VM wasn't upscaled (so severely that we originally thought the cgroup was frozen). This was due exclusively to the throttling by the kernel from memory.high itself, so we changed the vm-informant to gradually increase memory.high when we hit it, to avoid this throttling, in neondatabase/autoscaling#223 (2023-05-15). See also: the original issue.

The original implementation of the vm-monitor did not have this "gradually increase memory.high" logic, and after we finally first enabled the cgroup integration using the vm-monitor (#4920), we started seeing VMs stalling due to the same memory.high throttling that previously existed with the vm-informant:

Typical symptoms were that it took ~20s to connect to postgres with psql from inside the VM, and \d+ took longer than we cared to wait for (> 2 minutes).

We initially fixed this by reimplementing the vm-informant's updated logic (see #5303; every time we get a memory.high event, increase memory.high by a small amount), but it turned out this wasn't a complete fix, due to some remaining quirks - some fixed, and some work still required (more detail below).

This issue exists to track the rest of the work required to fix any remaining instances of this throttling.

Planned changes

At a high level, there's two pieces of work remaining, at time of writing (check the task list below). There's also a secret third piece that's blocking the other two.

First, we need to fix the bug that allows memory.high to be decreased when the vm-monitor is notified of increases to the VM's memory size. Ideally this would be fairly straightforward, but there's some existing raciness between the logic handling upscaling/downscaling and the cgroup watcher's own "increase memory.high if we need to" that should be resolved first (or, alongside it) — in short, they both touch memory.high.

Second: We need to provide a mechanism outside downscaling to decrease the cgroup's memory.high value. Periodically decreasing memory.high if we can (assuming current memory usage is some threshold below target value) should suffice here. The cgroup event watcher is already responsible for increasing memory.high when we hit it to avoid throttling, and so it's kind of the ideal choice for managing this as well.

Therefore, before either of these, we need to move memory.high handling into the cgroup event watcher with the end result being that:

  1. It's responsible for approving downscaling (or not); and
  2. Responding to an upscaling message blocks on the cgroup event watcher executing the change

In addition to fixing the existing raciness, that should make the rest of this much simpler to implement.

Tasks

Tasks

Preview Give feedback
@sharnoff sharnoff added t/bug Issue Type: Bug t/Epic Issue type: Epic c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) labels Oct 3, 2023
@sharnoff sharnoff self-assigned this Oct 3, 2023
@hlinnaka
Copy link
Contributor

hlinnaka commented Oct 3, 2023

I think we need a bigger redesign of memory autoscaling.

It's never OK for the OOM killer to kill PostgreSQL. It's trivial for a SQL query to request an arbitrary amount of memory, and that must not lead to the whole server being restarted. The correct response is to get a graceful "ERROR: out of memory" in the offending backend, without affecting the rest of the system.

Currently, with a 0.25 CU endpoint:

neondb=> select repeat('x', 900000000);
SSL connection has been closed unexpectedly
The connection to the server was lost. Attempting reset: Succeeded.

That is not acceptable.

I believe with cgroup memory limit, the only possible responses to reaching the memory limit is for the OOM killer to kill the process, or to pause the process. Neither of those is what we want. We need to find a different mechanism that does not use cgroups.

@sharnoff
Copy link
Member Author

sharnoff commented Oct 3, 2023

It's never OK for the OOM killer to kill PostgreSQL

[ ... ]

I believe with cgroup memory limit, the only possible responses to reaching the memory limit is for the OOM killer to kill the process, or to pause the process

I agree with you. I think what you're discussing is orthogonal to this issue: When we hit memory.high, there's no strict requirements about what we do. We previously had a cgroup-level memory limit (via memory.max), and that was recently removed in #5333.

@hlinnaka
Copy link
Contributor

hlinnaka commented Oct 3, 2023

It's never OK for the OOM killer to kill PostgreSQL
[ ... ]
I believe with cgroup memory limit, the only possible responses to reaching the memory limit is for the OOM killer to kill the process, or to pause the process

I agree with you. I think what you're discussing is orthogonal to this issue: When we hit memory.high, there's no strict requirements about what we do. We previously had a cgroup-level memory limit (via memory.max), and that was recently removed in #5333.

Hmm, so why is Postgres still getting killed when you run a query lie that?

Oh, it's because we allow overcommit:

root@compute-snowy-breeze-24803104-pjghf:~# cat /proc/sys/vm/overcommit_memory 
0

We should disable that (echo 2 > /proc/sys/vm/overcommit_memory), but I agree that's a different problem then.

sharnoff added a commit that referenced this issue Oct 6, 2023
The general idea of this PR is to move the on-downscale and on-upscale
cgroup handling logic into into the CgroupWatcher itself via message
passing of commands, rather than directly acting on the cgroup from the
thread handling the websocket message.

This change is the large pre-requisite to a handful of smaller changes
that should be much easier with this, all part of the Epic about fixing
memory.high throttling (#5444):

1. Fix a potential race condition wherein the logic that increases
   memory.high in response to memory.high events could overwrite newer
   (more permissive) values set by actual upscaling
      - **Handled by this change already!**
2. Fix a bug where due to already increased memory.high to avoid
   throttling, upscaling actually decreases memory.high and leads to
   unrecoverable throttling.
3. If memory.high has been increased to avoid throttling but no
   upscaling has happened, periodically try to decrease back to the
   desired memory.high.

For more general context, refer to #5444.
sharnoff added a commit that referenced this issue Oct 17, 2023
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
sharnoff added a commit that referenced this issue Oct 20, 2023
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) t/bug Issue Type: Bug t/Epic Issue type: Epic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants