compute_ctl apply_config sets GUC max_stack_depth which terminates parallel worker (e.g. creat index parallel worker) #10184

Bodobolero · 2024-12-18T13:04:16Z

Steps to reproduce

Run a long-running maintenance task (like create btree index) with parallel workers.
Sporadically these tasks fail during apply_config operation with error message

ERROR: parameter "max_stack_depth" cannot be set during a parallel operation

apply_config uses pg_ctl reload -D to send SIGHUP to postgres and somehow this causes max_stack_depth to be set.

And indeed it seems postgres is ALWAYS setting max_stack_depth in
https://github.com/neondatabase/postgres/blob/97f9fde349c6de6d573f5ce96db07eca60ce6185/src/backend/utils/misc/guc.c#L1585
if (stack_rlimit > 0) && if (new_limit > 100)
when it receives a SIGHUP

I would consider this an upstream bug, because it means that pg_ctl reload command would terminate parallel operations.
For Neon this is extremely bad as we raise SIGHUP in every apply_config operation which is quite frequent

Expected result

Long-running tasks complete as in vanilla postgres (where normally the "system" doesn't send frequent SIGHUP to postgres)

Actual result

Sporadic failure

Environment

Staging, for example endpoint `ep-summer-darkness-w2ldx7r7.us-east-2.aws.neon.build/neondb

Logs, links

Example of failing statement

CREATE UNIQUE INDEX events_pkey ON public.events USING btree (account_id, recorded_at, id);

this is on table (96 GiB):

ludicrous=> \d events
                                 Table "public.events"
    Column     |            Type             | Collation | Nullable |      Default      
---------------+-----------------------------+-----------+----------+-------------------
 account_id    | bigint                      |           | not null | 
 recorded_at   | timestamp without time zone |           | not null | 
 id            | uuid                        |           | not null | gen_random_uuid()
 user_id       | bigint                      |           |          | 
 device_id     | bigint                      |           |          | 
 event_name_id | bigint                      |           | not null | 
 properties    | jsonb                       |           | not null | '{}'::jsonb

when using
-c maintenance_work_mem=8388608 -c max_parallel_maintenance_workers=7

see https://github.com/neondatabase/neon/actions/runs/12293061677/job/34304987910
and discussion here https://neondb.slack.com/archives/C04DGM6SMTM/p1734513223148249?thread_ts=1733997259.898819&cid=C04DGM6SMTM

The text was updated successfully, but these errors were encountered:

ololobus · 2024-12-18T14:48:09Z

For Neon this is extremely bad as we raise SIGHUP in every apply_config operation which is quite frequent

apply_config operations are rather rare. They are performed by cplane and usually triggered by i) user actions; ii) shard re-balancing; iii) storage deploys; BUT you are right that it's especially bad for Neon, just a different reason -- autoscaling agent sends them all the time to bigger computes, because it scales LFC size up and down. And these operations could be much more frequent than `apply_config's

Bodobolero · 2024-12-18T15:41:16Z

apply_config operations are rather rare. ... ii) shard re-balancing;

When we import a project or restore a pg_dump the project grows gradually. Initially the project will not be sharded. Once it reaches a certain size it will be sharded and shard re-balancing takes place causing apply_config operation.
So I think in Neon the apply_config is rather the expected thing to happen during pg_restore or project import for a project size that exceeds the threshold where we shard a project
@ololobus

Given that import is an important business use case for Neon (we want to attract new customers) I consider this a high priority issue and we shouldn't wait for an upstream fix but maybe find a neon specific work-around or patch postgres to only update max_stack_depth if RLIMIT_STACK really has changed from the prior setting.

Bodobolero added a/performance Area: relates to performance of the system c/compute Component: compute, excluding postgres itself c/control-plane Component: Control Plane c/PostgreSQL Component: PostgreSQL features and bugs t/bug Issue Type: Bug labels Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute_ctl apply_config sets GUC max_stack_depth which terminates parallel worker (e.g. creat index parallel worker) #10184

compute_ctl apply_config sets GUC max_stack_depth which terminates parallel worker (e.g. creat index parallel worker) #10184

Bodobolero commented Dec 18, 2024 •

edited

Loading

ololobus commented Dec 18, 2024

Bodobolero commented Dec 18, 2024 •

edited

Loading

compute_ctl apply_config sets GUC max_stack_depth which terminates parallel worker (e.g. creat index parallel worker) #10184

compute_ctl apply_config sets GUC max_stack_depth which terminates parallel worker (e.g. creat index parallel worker) #10184

Comments

Bodobolero commented Dec 18, 2024 • edited Loading

Steps to reproduce

Expected result

Actual result

Environment

Logs, links

ololobus commented Dec 18, 2024

Bodobolero commented Dec 18, 2024 • edited Loading

Bodobolero commented Dec 18, 2024 •

edited

Loading

Bodobolero commented Dec 18, 2024 •

edited

Loading