-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neonvm postgres gets stuck with 0 TPS when running pgbench #5678
Comments
Possibly related ticket: https://neondb.slack.com/archives/C03FHB4TASC/p1698561965608509 |
I reproduced this in il-central-1. What I found is that
This state can stay like this for more than a minute. I also checked |
Write in file is now performed in LFC under exclusive lock. So I think this problem is not caused by |
Since #5524, we only use the cgroup for the purposes of fetching memory statistics — it shouldn't have an effect here |
Status update: we reproduced this several times and looked at it together with @sharnoff. This issue looks VM/autoscaling related, but the root-cause is still not clear. Something happens with postgres processes, they're stuck waiting for something. The general case of this issue looks like this:
Here's an example of debug data at the moment of 0 TPS / everything-is-stuck: https://gist.github.com/petuhovskiy/2c30a18d716aba60c354fe4630fcb9c4 More details in the linked slack thread. |
Quoting DM from @sharnoff
I can imagine the lfc_lock holder is getting stalled by the kernel in page reclamation. Does this only reproduce in neonvm or also in k8s-pod? |
this week: @sharnoff to understand the root cause of the issue and to test the new patch from Konstantin with LFC caches |
I do not think that this problem is somehow related with lfc_lock and my patch can actually prevent it. But this LFC patch can really improve performance, so it will be nice to review and merge it. |
@knizhnik I don't think the lfc_lock issue is the cause, but I suspect it makes the impact of the issue worse. My current best guess at what's happening is this:
So, perhaps the VM isn't "frozen", but just very very slow. But again, this is just my current hypothesis, so I may be incorrect or misunderstanding something. |
I suspect that if there is no lfc_lock, then situation may be even worser: in this case many backends will perform parallel IO operations and there are more chances that all space for buffer is exhausted and system is stucked. |
As part of debugging neondatabase/neon#5678, we want to test certain kernel configurations. Currently, this requires rebuilding the kernel and redploying neonvm-controller with an updated neonvm-runner that embeds the new kernel. That's too much effort. So instead, we'd like to be able to use a custom kernel on a per-VM basis. Because this *generally* should not be the common path, this PR adds that as a new init container to the runner pod, using the same "download via image" that we do for root disks.
As part of debugging neondatabase/neon#5678, we want to test certain kernel configurations. Currently, this requires rebuilding the kernel and redploying neonvm-controller with an updated neonvm-runner that embeds the new kernel. That's too much effort. So instead, we'd like to be able to use a custom kernel on a per-VM basis. Because this *generally* should not be the common path, this PR adds that as a new init container to the runner pod, using the same "download via image" that we do for root disks.
Current state of this issue (duplicated from slack): We ran a bunch of these VMs with new kernels and found our k8s nodes got saturated on IO with only a few of the pgbench workloads — lots of periods of 0 TPS, but may or may not be related. So, we tested the new kernels:
Next steps are, in order of decreasing priority:
|
Marking this closed, as it was fixed with today's release. We can re-open if it turns out to persist. |
Steps to reproduce
Run this script:
Expected result
Script runs without issues.
Actual result
Sometimes it errors with
Timed out while waiting for WAL
errors, sometimes there's just no progress for a few seconds, and pgbench logs have0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
in them.When I last time looked at this issue, it seemed that WAL can get stuck in computes without reaching safekeepers for a few minutes.
How to debug
Run the script and attach to the compute node. Then you should poll the compute every second, getting the current walproposer status, this should give some info on why WAL is not streamed to safekeepers.
Environment
prod, il-central-1, k8s-neonvm, postgres 15
Logs, links
https://neondb.slack.com/archives/C04KGFVUWUQ/p1697642632378239
The text was updated successfully, but these errors were encountered: