neonvm postgres gets stuck with 0 TPS when running pgbench #5678

petuhovskiy · 2023-10-26T12:15:49Z

Steps to reproduce

Run this script:

#! /usr/bin/env bash

set -euo pipefail

API_HOST=console.neon.tech
NEON_API_KEY=${NEON_API_KEY:-}

if [ -z "${NEON_API_KEY}" ]; then
    echo >&2 "Please set NEON_API_KEY env var"
    exit 1
fi

pg_version=15
autoscaling_limit_min_cu=0.25
autoscaling_limit_max_cu=2
provisioner=k8s-neonvm
region_id=aws-il-central-1

project=$(curl -s -q "https://${API_HOST}/api/v2/projects" \
                --header "Authorization: Bearer ${NEON_API_KEY}" \
                --json "{ \"project\": { \"name\": \"$region_id: PG$pg_version $autoscaling_limit_min_cu-$autoscaling_limit_max_cu\", \
                                         \"region_id\": \"$region_id\", \
                                         \"pg_version\": $pg_version, \
                                         \"provisioner\": \"$provisioner\", \
                                         \"autoscaling_limit_min_cu\": $autoscaling_limit_min_cu, \
                                         \"autoscaling_limit_max_cu\": $autoscaling_limit_max_cu }}")

project_id=$(echo $project | jq -r '.project.id')
branch_id=$(echo $project | jq -r '.branch.id')

connstr=$(echo $project | jq -r '.connection_uris[0].connection_uri')

pg_isready -d $connstr -t 10
sleep 1 # Just in case wait for one extra second

ro_endpoint=$(curl -s https://console.neon.tech/api/v2/projects/$project_id/endpoints \
                --header "authorization: Bearer $NEON_API_KEY" \
                --json "{ \"endpoint\": { \"type\": \"read_only\", \
                                          \"branch_id\": \"$branch_id\", \
                                          \"provisioner\": \"$provisioner\", \
                                          \"autoscaling_limit_min_cu\": $autoscaling_limit_min_cu, \
                                          \"autoscaling_limit_max_cu\": $autoscaling_limit_max_cu  }}")

rw_host=$(echo $project | jq -r '.connection_uris[0].connection_parameters.host')
ro_host=$(echo $ro_endpoint | jq -r '.endpoint.host')
ro_connstr=$(echo $connstr | sed -e "s/$rw_host/$ro_host/")

TS=$(date +%s)
screen -L -Logfile "$HOME/screen-$TS.log" -S pgbench-run-$TS -d -m bash -c " \
        echo $connstr; \
        pgbench -i                    -s 684       ${connstr}; \
        echo; \
        pgbench -c 4 -b simple-update -T 3600 -P 5 --progress-timestamp ${connstr}; \
        echo; \
        pgbench -c 4 -b select-only   -T 3600 -P 5 --progress-timestamp ${connstr}; \
        echo; \
        echo $ro_connstr; \
        pgbench -c 4 -b select-only   -T 3600 -P 5 --progress-timestamp -n ${ro_connstr}; \
        "

Expected result

Script runs without issues.

Actual result

Sometimes it errors with Timed out while waiting for WAL errors, sometimes there's just no progress for a few seconds, and pgbench logs have 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed in them.

When I last time looked at this issue, it seemed that WAL can get stuck in computes without reaching safekeepers for a few minutes.

How to debug

Run the script and attach to the compute node. Then you should poll the compute every second, getting the current walproposer status, this should give some info on why WAL is not streamed to safekeepers.

Environment

prod, il-central-1, k8s-neonvm, postgres 15

Logs, links

https://neondb.slack.com/archives/C04KGFVUWUQ/p1697642632378239

The text was updated successfully, but these errors were encountered:

sharnoff · 2023-10-30T16:35:54Z

Possibly related ticket: https://neondb.slack.com/archives/C03FHB4TASC/p1698561965608509

petuhovskiy · 2023-10-30T19:05:08Z

I reproduced this in il-central-1. What I found is that 0.0 tps in pgbench output corresponds to the lfc_lock in the pg_stat_activity. Here's an example:

 datid | datname  | pid  | leader_pid | usesysid |   usename   |            application_name            | client_addr  | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event | state  | backend_xid | backend_xmin |       query_id       |                                     query                                     |  backend_type  
-------+----------+------+------------+----------+-------------+----------------------------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------+--------------+----------------------+-------------------------------------------------------------------------------+----------------
 16386 | neondb   | 5561 |            |    16385 | arthur      | pgbench                                | 10.13.70.147 |                 |       47292 | 2023-10-30 18:42:09.305737+00 | 2023-10-30 18:43:29.47953+00  | 2023-10-30 18:43:29.481577+00 | 2023-10-30 18:43:29.481577+00 |                 |            | active |             |        29581 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + 857 WHERE aid = 57164082;   | client backend
       |          | 4761 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4758 |            |       10 | cloud_admin | pg_cron scheduler                      |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
       |          | 4759 |            |       10 | cloud_admin | TimescaleDB Background Worker Launcher |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
       |          | 4757 |            |          |             |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4765 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4775 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4930 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4929 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
 16386 | neondb   | 5562 |            |    16385 | arthur      | pgbench                                | 10.13.61.145 |                 |       54284 | 2023-10-30 18:42:09.49305+00  | 2023-10-30 18:43:29.481181+00 | 2023-10-30 18:43:29.483586+00 | 2023-10-30 18:43:29.483586+00 | LWLock          | lfc_lock   | active |             |        29581 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + -3631 WHERE aid = 4262729;  | client backend
 16386 | neondb   | 5187 |            |    16385 | arthur      | psql                                   | 10.13.13.183 |                 |       58354 | 2023-10-30 18:40:52.923509+00 | 2023-10-30 18:43:36.224921+00 | 2023-10-30 18:43:36.224921+00 | 2023-10-30 18:43:36.224923+00 |                 |            | active |             |        29582 | -6815546958608512757 | SELECT * FROM pg_stat_activity;                                               | client backend
 16386 | neondb   | 5332 |            |    16385 | arthur      | psql                                   | 10.13.61.145 |                 |       43744 | 2023-10-30 18:41:16.728073+00 |                               | 2023-10-30 18:43:35.649612+00 | 2023-10-30 18:43:35.649826+00 | Client          | ClientRead | idle   |             |              | -3794202421633444576 | SELECT  pg_current_wal_insert_lsn(), now();                                   | client backend
 16386 | neondb   | 5563 |            |    16385 | arthur      | pgbench                                | 10.13.70.147 |                 |       47300 | 2023-10-30 18:42:09.694926+00 | 2023-10-30 18:43:29.485072+00 | 2023-10-30 18:43:29.485921+00 | 2023-10-30 18:43:29.485921+00 | LWLock          | lfc_lock   | active |             |        29582 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + -1416 WHERE aid = 14687375; | client backend
 16386 | neondb   | 5564 |            |    16385 | arthur      | pgbench                                | 10.13.13.183 |                 |       35762 | 2023-10-30 18:42:09.934542+00 | 2023-10-30 18:43:29.480972+00 | 2023-10-30 18:43:29.483411+00 | 2023-10-30 18:43:29.483411+00 | LWLock          | lfc_lock   | active |             |        29581 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + -4748 WHERE aid = 2116860;  | client backend
       |          | 4754 |            |          |             |                                        |              |                 |             |                               |                               |                               |

This state can stay like this for more than a minute. I also checked SELECT pg_current_wal_insert_lsn(), it matches LSN from the walproposer, so it doesn't look like a safekeeper/walproposer issue.

knizhnik · 2023-10-30T20:00:21Z

Write in file is now performed in LFC under exclusive lock.
I am going to change it in next LFC fix (although write operation should is expected to be fast because size of LFC is small enough and it is mostly present in memory).

So I think this problem is not caused by lfc_lock but either cgroups either somebody else is blocking process which holds lfc_lock. Will be very interesting to see backtrace.

sharnoff · 2023-10-30T20:52:02Z

either cgroups either somebody else is blocking process which holds lfc_lock.

Since #5524, we only use the cgroup for the purposes of fetching memory statistics — it shouldn't have an effect here

petuhovskiy · 2023-11-01T11:44:11Z

Status update: we reproduced this several times and looked at it together with @sharnoff. This issue looks VM/autoscaling related, but the root-cause is still not clear. Something happens with postgres processes, they're stuck waiting for something.

The general case of this issue looks like this:

one postgres process is completely stuck, gdb cannot attach to it and we don't know the exact reason why it's blocked (can be memory or disk access)
other postgres processes also stuck, but probably they are just depending on some common lock that 1st process is currently holding; state in pg_stat_activity is usually displayed as lfc_lock or SyncRep; we can also get stacktrace with gdb, it's usualy futex or epoll

Here's an example of debug data at the moment of 0 TPS / everything-is-stuck: https://gist.github.com/petuhovskiy/2c30a18d716aba60c354fe4630fcb9c4

More details in the linked slack thread.

problame · 2023-11-03T09:15:28Z

Quoting DM from @sharnoff

Basically, external symptoms are that a majority of the processes in the VM get "stuck", maybe triggered by attempting to hotplug memory, but not sure. If connected to the VM, we can observe:
Very little free memory - 20MiB or so (the majority is buff/cache, then other memory usage)
cat /sys/fs/cgroup/memory.pressure shows 98% "full", 99% "some"
cat /sys/fs/cgroup/io.pressure shows very little IO pressure (maybe IO due to writeback doesn't count?)
OTOH, it seems like processes are not impacted if they aren't allocating (much) memory and aren't reading from disk - which, given the memory pressure stats, makes sense.

I can imagine the lfc_lock holder is getting stalled by the kernel in page reclamation.

Does this only reproduce in neonvm or also in k8s-pod?

vadim2404 · 2023-11-06T16:37:06Z

this week: @sharnoff to understand the root cause of the issue and to test the new patch from Konstantin with LFC caches

knizhnik · 2023-11-06T16:42:25Z

I do not think that this problem is somehow related with lfc_lock and my patch can actually prevent it.
I agree with @sharnoff that most likely blocking of processes in syscalls is caused by very little free memory and lack of free buffers.

But this LFC patch can really improve performance, so it will be nice to review and merge it.

sharnoff · 2023-11-06T19:03:24Z

@knizhnik I don't think the lfc_lock issue is the cause, but I suspect it makes the impact of the issue worse.

My current best guess at what's happening is this:

The VM gets to a point where very little memory is free (most of it is buffers)
We try to add memory to the VM, but that first requires reserving some additional memory for the kernel, so buffers need to be evicted
Due to a base rate of IO, the sudden spike in evictions causes the rate of IO requests to be higher than the device can satisfy, so IO starts to take much longer
- I've seen this on my own machine, doing IO-intensive tasks with very little free memory
Because lfc_lock is held during IO (which means that it can't do IO in parallel, if I understand correctly?), syscalls that ordinarily would complete very quickly start to get blocked on this long IO queue, which causes operations that touch the LFC to be disproportionately impacted

So, perhaps the VM isn't "frozen", but just very very slow.

But again, this is just my current hypothesis, so I may be incorrect or misunderstanding something.

knizhnik · 2023-11-06T19:46:44Z

I suspect that if there is no lfc_lock, then situation may be even worser: in this case many backends will perform parallel IO operations and there are more chances that all space for buffer is exhausted and system is stucked.

As part of debugging neondatabase/neon#5678, we want to test certain kernel configurations. Currently, this requires rebuilding the kernel and redploying neonvm-controller with an updated neonvm-runner that embeds the new kernel. That's too much effort. So instead, we'd like to be able to use a custom kernel on a per-VM basis. Because this *generally* should not be the common path, this PR adds that as a new init container to the runner pod, using the same "download via image" that we do for root disks.

sharnoff · 2023-11-17T11:22:35Z

Current state of this issue (duplicated from slack):
@problame and I were consistently reproducing in eu-west-1. When we tried using an updated kernel (6.5.11 via Ubuntu config because it was easier), the computes were running smoothly for much longer than before, and when they experienced 0 TPS, they both had issues at the exact same time, and it correlated with a storage deploy — so, probably a different issue.

We ran a bunch of these VMs with new kernels and found our k8s nodes got saturated on IO with only a few of the pgbench workloads — lots of periods of 0 TPS, but may or may not be related.

So, we tested the new kernels:

Bumped IOPS for a node in eu-west-1 from 10k → 100k and ran 5 VMs there for 4 hours:
- Verdict: VMs had 1-6 individual seconds with 0 TPS — maximum duration of 0 TPS was 1 second
Ran 6 VMs in us-east-2, with forced pageservers to get guaranteed AZ distribution (3 nodes in different AZs; only 2 per node)
- Verdict: VMs had 20-60 individual seconds with 0 TPS — maximum duration of 0 TPS was 5 seconds.
- There were a couple periods where 2 VMs in different AZs had a few seconds of 0 TPS. Unclear what the cause was.

Next steps are, in order of decreasing priority:

Checking that the improvement is because of new kernel, and not just the ubuntu config
New autoscaling release with the fixed kernel
Bisect kernel versions to figure out when it was fixed (to build expertise)
Continue efforts to
1. reduce IO usage (see Feature: Limit IOps for VMs autoscaling#576, neonvm/runner: treat guest-side disk flush requests as no-ops autoscaling#628)
2. investigate remaining periods of 0 TPS

sharnoff · 2023-11-27T18:31:54Z

Marking this closed, as it was fixed with today's release. We can re-open if it turns out to persist.

petuhovskiy added t/bug Issue Type: Bug c/storage/safekeeper Component: storage: safekeeper labels Oct 26, 2023

petuhovskiy self-assigned this Oct 26, 2023

petuhovskiy removed their assignment Nov 1, 2023

petuhovskiy changed the title ~~WAL timeout with pgbench and neonvm~~ neonvm postgres gets stuck with 0 TPS when running pgbench Nov 1, 2023

petuhovskiy added a/benchmark Area: related to benchmarking c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) c/cloud/compute and removed c/storage/safekeeper Component: storage: safekeeper labels Nov 1, 2023

sharnoff mentioned this issue Nov 2, 2023

neonvm-controller should have pprof endpoint neondatabase/autoscaling#608

Closed

sharnoff added c/autoscaling Component: autoscaling (general category) and removed c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) labels Nov 2, 2023

sharnoff self-assigned this Nov 2, 2023

sharnoff mentioned this issue Nov 14, 2023

Epic: vm-monitor should check postgres during health checks neondatabase/autoscaling#619

Open

This was referenced Nov 14, 2023

neonvm: Allow per-VM custom kernel images neondatabase/autoscaling#621

Merged

enable /proc/kallsyms, this should enable symbolified backtraces neondatabase/autoscaling#618

Merged

problame mentioned this issue Nov 17, 2023

switch 5.15.80 kernel to near-ubuntu-mainline config neondatabase/autoscaling#630

Closed

sharnoff mentioned this issue Nov 22, 2023

switch to kernel 6.1.63 neondatabase/autoscaling#638

Merged

sharnoff closed this as completed Nov 27, 2023

stepashka added the c/compute Component: compute, excluding postgres itself label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neonvm postgres gets stuck with 0 TPS when running pgbench #5678

neonvm postgres gets stuck with 0 TPS when running pgbench #5678

petuhovskiy commented Oct 26, 2023

sharnoff commented Oct 30, 2023

petuhovskiy commented Oct 30, 2023

knizhnik commented Oct 30, 2023

sharnoff commented Oct 30, 2023

petuhovskiy commented Nov 1, 2023

problame commented Nov 3, 2023

vadim2404 commented Nov 6, 2023

knizhnik commented Nov 6, 2023

sharnoff commented Nov 6, 2023

knizhnik commented Nov 6, 2023

sharnoff commented Nov 17, 2023

sharnoff commented Nov 27, 2023

neonvm postgres gets stuck with 0 TPS when running pgbench #5678

neonvm postgres gets stuck with 0 TPS when running pgbench #5678

Comments

petuhovskiy commented Oct 26, 2023

Steps to reproduce

Expected result

Actual result

How to debug

Environment

Logs, links

sharnoff commented Oct 30, 2023

petuhovskiy commented Oct 30, 2023

knizhnik commented Oct 30, 2023

sharnoff commented Oct 30, 2023

petuhovskiy commented Nov 1, 2023

problame commented Nov 3, 2023

vadim2404 commented Nov 6, 2023

knizhnik commented Nov 6, 2023

sharnoff commented Nov 6, 2023

knizhnik commented Nov 6, 2023

sharnoff commented Nov 17, 2023

sharnoff commented Nov 27, 2023