Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neonvm postgres gets stuck with 0 TPS when running pgbench #5678

Closed
petuhovskiy opened this issue Oct 26, 2023 · 12 comments
Closed

neonvm postgres gets stuck with 0 TPS when running pgbench #5678

petuhovskiy opened this issue Oct 26, 2023 · 12 comments
Assignees
Labels
a/benchmark Area: related to benchmarking c/autoscaling Component: autoscaling (general category) c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@petuhovskiy
Copy link
Member

Steps to reproduce

Run this script:

#! /usr/bin/env bash

set -euo pipefail

API_HOST=console.neon.tech
NEON_API_KEY=${NEON_API_KEY:-}

if [ -z "${NEON_API_KEY}" ]; then
    echo >&2 "Please set NEON_API_KEY env var"
    exit 1
fi

pg_version=15
autoscaling_limit_min_cu=0.25
autoscaling_limit_max_cu=2
provisioner=k8s-neonvm
region_id=aws-il-central-1

project=$(curl -s -q "https://${API_HOST}/api/v2/projects" \
                --header "Authorization: Bearer ${NEON_API_KEY}" \
                --json "{ \"project\": { \"name\": \"$region_id: PG$pg_version $autoscaling_limit_min_cu-$autoscaling_limit_max_cu\", \
                                         \"region_id\": \"$region_id\", \
                                         \"pg_version\": $pg_version, \
                                         \"provisioner\": \"$provisioner\", \
                                         \"autoscaling_limit_min_cu\": $autoscaling_limit_min_cu, \
                                         \"autoscaling_limit_max_cu\": $autoscaling_limit_max_cu }}")

project_id=$(echo $project | jq -r '.project.id')
branch_id=$(echo $project | jq -r '.branch.id')

connstr=$(echo $project | jq -r '.connection_uris[0].connection_uri')

pg_isready -d $connstr -t 10
sleep 1 # Just in case wait for one extra second

ro_endpoint=$(curl -s https://console.neon.tech/api/v2/projects/$project_id/endpoints \
                --header "authorization: Bearer $NEON_API_KEY" \
                --json "{ \"endpoint\": { \"type\": \"read_only\", \
                                          \"branch_id\": \"$branch_id\", \
                                          \"provisioner\": \"$provisioner\", \
                                          \"autoscaling_limit_min_cu\": $autoscaling_limit_min_cu, \
                                          \"autoscaling_limit_max_cu\": $autoscaling_limit_max_cu  }}")

rw_host=$(echo $project | jq -r '.connection_uris[0].connection_parameters.host')
ro_host=$(echo $ro_endpoint | jq -r '.endpoint.host')
ro_connstr=$(echo $connstr | sed -e "s/$rw_host/$ro_host/")

TS=$(date +%s)
screen -L -Logfile "$HOME/screen-$TS.log" -S pgbench-run-$TS -d -m bash -c " \
        echo $connstr; \
        pgbench -i                    -s 684       ${connstr}; \
        echo; \
        pgbench -c 4 -b simple-update -T 3600 -P 5 --progress-timestamp ${connstr}; \
        echo; \
        pgbench -c 4 -b select-only   -T 3600 -P 5 --progress-timestamp ${connstr}; \
        echo; \
        echo $ro_connstr; \
        pgbench -c 4 -b select-only   -T 3600 -P 5 --progress-timestamp -n ${ro_connstr}; \
        "

Expected result

Script runs without issues.

Actual result

Sometimes it errors with Timed out while waiting for WAL errors, sometimes there's just no progress for a few seconds, and pgbench logs have 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed in them.

When I last time looked at this issue, it seemed that WAL can get stuck in computes without reaching safekeepers for a few minutes.

How to debug

Run the script and attach to the compute node. Then you should poll the compute every second, getting the current walproposer status, this should give some info on why WAL is not streamed to safekeepers.

Environment

prod, il-central-1, k8s-neonvm, postgres 15

Logs, links

https://neondb.slack.com/archives/C04KGFVUWUQ/p1697642632378239

@petuhovskiy petuhovskiy added t/bug Issue Type: Bug c/storage/safekeeper Component: storage: safekeeper labels Oct 26, 2023
@petuhovskiy petuhovskiy self-assigned this Oct 26, 2023
@sharnoff
Copy link
Member

@petuhovskiy
Copy link
Member Author

I reproduced this in il-central-1. What I found is that 0.0 tps in pgbench output corresponds to the lfc_lock in the pg_stat_activity. Here's an example:

 datid | datname  | pid  | leader_pid | usesysid |   usename   |            application_name            | client_addr  | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event | state  | backend_xid | backend_xmin |       query_id       |                                     query                                     |  backend_type  
-------+----------+------+------------+----------+-------------+----------------------------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------+--------------+----------------------+-------------------------------------------------------------------------------+----------------
 16386 | neondb   | 5561 |            |    16385 | arthur      | pgbench                                | 10.13.70.147 |                 |       47292 | 2023-10-30 18:42:09.305737+00 | 2023-10-30 18:43:29.47953+00  | 2023-10-30 18:43:29.481577+00 | 2023-10-30 18:43:29.481577+00 |                 |            | active |             |        29581 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + 857 WHERE aid = 57164082;   | client backend
       |          | 4761 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4758 |            |       10 | cloud_admin | pg_cron scheduler                      |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
       |          | 4759 |            |       10 | cloud_admin | TimescaleDB Background Worker Launcher |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
       |          | 4757 |            |          |             |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4765 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4775 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4930 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
     5 | postgres | 4929 |            |       10 | cloud_admin |                                        |              |                 |             |                               |                               |                               |                               |                 |            |        |             |              |                      | <insufficient privilege>                                                      | 
 16386 | neondb   | 5562 |            |    16385 | arthur      | pgbench                                | 10.13.61.145 |                 |       54284 | 2023-10-30 18:42:09.49305+00  | 2023-10-30 18:43:29.481181+00 | 2023-10-30 18:43:29.483586+00 | 2023-10-30 18:43:29.483586+00 | LWLock          | lfc_lock   | active |             |        29581 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + -3631 WHERE aid = 4262729;  | client backend
 16386 | neondb   | 5187 |            |    16385 | arthur      | psql                                   | 10.13.13.183 |                 |       58354 | 2023-10-30 18:40:52.923509+00 | 2023-10-30 18:43:36.224921+00 | 2023-10-30 18:43:36.224921+00 | 2023-10-30 18:43:36.224923+00 |                 |            | active |             |        29582 | -6815546958608512757 | SELECT * FROM pg_stat_activity;                                               | client backend
 16386 | neondb   | 5332 |            |    16385 | arthur      | psql                                   | 10.13.61.145 |                 |       43744 | 2023-10-30 18:41:16.728073+00 |                               | 2023-10-30 18:43:35.649612+00 | 2023-10-30 18:43:35.649826+00 | Client          | ClientRead | idle   |             |              | -3794202421633444576 | SELECT  pg_current_wal_insert_lsn(), now();                                   | client backend
 16386 | neondb   | 5563 |            |    16385 | arthur      | pgbench                                | 10.13.70.147 |                 |       47300 | 2023-10-30 18:42:09.694926+00 | 2023-10-30 18:43:29.485072+00 | 2023-10-30 18:43:29.485921+00 | 2023-10-30 18:43:29.485921+00 | LWLock          | lfc_lock   | active |             |        29582 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + -1416 WHERE aid = 14687375; | client backend
 16386 | neondb   | 5564 |            |    16385 | arthur      | pgbench                                | 10.13.13.183 |                 |       35762 | 2023-10-30 18:42:09.934542+00 | 2023-10-30 18:43:29.480972+00 | 2023-10-30 18:43:29.483411+00 | 2023-10-30 18:43:29.483411+00 | LWLock          | lfc_lock   | active |             |        29581 |  -702890799538953217 | UPDATE pgbench_accounts SET abalance = abalance + -4748 WHERE aid = 2116860;  | client backend
       |          | 4754 |            |          |             |                                        |              |                 |             |                               |                               |                               |    

This state can stay like this for more than a minute. I also checked SELECT pg_current_wal_insert_lsn(), it matches LSN from the walproposer, so it doesn't look like a safekeeper/walproposer issue.

@knizhnik
Copy link
Contributor

Write in file is now performed in LFC under exclusive lock.
I am going to change it in next LFC fix (although write operation should is expected to be fast because size of LFC is small enough and it is mostly present in memory).

So I think this problem is not caused by lfc_lock but either cgroups either somebody else is blocking process which holds lfc_lock. Will be very interesting to see backtrace.

@sharnoff
Copy link
Member

either cgroups either somebody else is blocking process which holds lfc_lock.

Since #5524, we only use the cgroup for the purposes of fetching memory statistics ­— it shouldn't have an effect here

@petuhovskiy petuhovskiy removed their assignment Nov 1, 2023
@petuhovskiy
Copy link
Member Author

Status update: we reproduced this several times and looked at it together with @sharnoff. This issue looks VM/autoscaling related, but the root-cause is still not clear. Something happens with postgres processes, they're stuck waiting for something.

The general case of this issue looks like this:

  • one postgres process is completely stuck, gdb cannot attach to it and we don't know the exact reason why it's blocked (can be memory or disk access)
  • other postgres processes also stuck, but probably they are just depending on some common lock that 1st process is currently holding; state in pg_stat_activity is usually displayed as lfc_lock or SyncRep; we can also get stacktrace with gdb, it's usualy futex or epoll

Here's an example of debug data at the moment of 0 TPS / everything-is-stuck: https://gist.github.com/petuhovskiy/2c30a18d716aba60c354fe4630fcb9c4

More details in the linked slack thread.

@petuhovskiy petuhovskiy changed the title WAL timeout with pgbench and neonvm neonvm postgres gets stuck with 0 TPS when running pgbench Nov 1, 2023
@petuhovskiy petuhovskiy added a/benchmark Area: related to benchmarking c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) c/cloud/compute and removed c/storage/safekeeper Component: storage: safekeeper labels Nov 1, 2023
@sharnoff sharnoff added c/autoscaling Component: autoscaling (general category) and removed c/autoscaling/vm-monitor Component: vm-monitor (autoscaling component inside each VM) labels Nov 2, 2023
@sharnoff sharnoff self-assigned this Nov 2, 2023
@problame
Copy link
Contributor

problame commented Nov 3, 2023

Quoting DM from @sharnoff

Basically, external symptoms are that a majority of the processes in the VM get "stuck", maybe triggered by attempting to hotplug memory, but not sure. If connected to the VM, we can observe:
Very little free memory - 20MiB or so (the majority is buff/cache, then other memory usage)
cat /sys/fs/cgroup/memory.pressure shows 98% "full", 99% "some"
cat /sys/fs/cgroup/io.pressure shows very little IO pressure (maybe IO due to writeback doesn't count?)
OTOH, it seems like processes are not impacted if they aren't allocating (much) memory and aren't reading from disk - which, given the memory pressure stats, makes sense.

I can imagine the lfc_lock holder is getting stalled by the kernel in page reclamation.


Does this only reproduce in neonvm or also in k8s-pod?

@vadim2404
Copy link
Contributor

this week: @sharnoff to understand the root cause of the issue and to test the new patch from Konstantin with LFC caches

@knizhnik
Copy link
Contributor

knizhnik commented Nov 6, 2023

I do not think that this problem is somehow related with lfc_lock and my patch can actually prevent it.
I agree with @sharnoff that most likely blocking of processes in syscalls is caused by very little free memory and lack of free buffers.

But this LFC patch can really improve performance, so it will be nice to review and merge it.

@sharnoff
Copy link
Member

sharnoff commented Nov 6, 2023

@knizhnik I don't think the lfc_lock issue is the cause, but I suspect it makes the impact of the issue worse.

My current best guess at what's happening is this:

  1. The VM gets to a point where very little memory is free (most of it is buffers)
  2. We try to add memory to the VM, but that first requires reserving some additional memory for the kernel, so buffers need to be evicted
  3. Due to a base rate of IO, the sudden spike in evictions causes the rate of IO requests to be higher than the device can satisfy, so IO starts to take much longer
    • I've seen this on my own machine, doing IO-intensive tasks with very little free memory
  4. Because lfc_lock is held during IO (which means that it can't do IO in parallel, if I understand correctly?), syscalls that ordinarily would complete very quickly start to get blocked on this long IO queue, which causes operations that touch the LFC to be disproportionately impacted

So, perhaps the VM isn't "frozen", but just very very slow.

But again, this is just my current hypothesis, so I may be incorrect or misunderstanding something.

@knizhnik
Copy link
Contributor

knizhnik commented Nov 6, 2023

I suspect that if there is no lfc_lock, then situation may be even worser: in this case many backends will perform parallel IO operations and there are more chances that all space for buffer is exhausted and system is stucked.

sharnoff added a commit to neondatabase/autoscaling that referenced this issue Nov 14, 2023
As part of debugging neondatabase/neon#5678, we want to test certain
kernel configurations. Currently, this requires rebuilding the kernel
and redploying neonvm-controller with an updated neonvm-runner that
embeds the new kernel. That's too much effort.

So instead, we'd like to be able to use a custom kernel on a per-VM
basis. Because this *generally* should not be the common path, this PR
adds that as a new init container to the runner pod, using the same
"download via image" that we do for root disks.
sharnoff added a commit to neondatabase/autoscaling that referenced this issue Nov 16, 2023
As part of debugging neondatabase/neon#5678, we want to test certain
kernel configurations. Currently, this requires rebuilding the kernel
and redploying neonvm-controller with an updated neonvm-runner that
embeds the new kernel. That's too much effort.

So instead, we'd like to be able to use a custom kernel on a per-VM
basis. Because this *generally* should not be the common path, this PR
adds that as a new init container to the runner pod, using the same
"download via image" that we do for root disks.
@sharnoff
Copy link
Member

Current state of this issue (duplicated from slack):
@problame and I were consistently reproducing in eu-west-1. When we tried using an updated kernel (6.5.11 via Ubuntu config because it was easier), the computes were running smoothly for much longer than before, and when they experienced 0 TPS, they both had issues at the exact same time, and it correlated with a storage deploy — so, probably a different issue.

We ran a bunch of these VMs with new kernels and found our k8s nodes got saturated on IO with only a few of the pgbench workloads — lots of periods of 0 TPS, but may or may not be related.

So, we tested the new kernels:

  • Bumped IOPS for a node in eu-west-1 from 10k → 100k and ran 5 VMs there for 4 hours:
    • Verdict: VMs had 1-6 individual seconds with 0 TPS — maximum duration of 0 TPS was 1 second
  • Ran 6 VMs in us-east-2, with forced pageservers to get guaranteed AZ distribution (3 nodes in different AZs; only 2 per node)
    • Verdict: VMs had 20-60 individual seconds with 0 TPS — maximum duration of 0 TPS was 5 seconds.
    • There were a couple periods where 2 VMs in different AZs had a few seconds of 0 TPS. Unclear what the cause was.

Next steps are, in order of decreasing priority:

  1. Checking that the improvement is because of new kernel, and not just the ubuntu config
  2. New autoscaling release with the fixed kernel
  3. Bisect kernel versions to figure out when it was fixed (to build expertise)
  4. Continue efforts to
    1. reduce IO usage (see Feature: Limit IOps for VMs autoscaling#576, neonvm/runner: treat guest-side disk flush requests as no-ops autoscaling#628)
    2. investigate remaining periods of 0 TPS

@sharnoff
Copy link
Member

Marking this closed, as it was fixed with today's release. We can re-open if it turns out to persist.

@stepashka stepashka added the c/compute Component: compute, excluding postgres itself label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/benchmark Area: related to benchmarking c/autoscaling Component: autoscaling (general category) c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

6 participants