Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute_ctl apply_config sets GUC max_stack_depth which terminates parallel worker (e.g. creat index parallel worker) #10184

Open
Bodobolero opened this issue Dec 18, 2024 · 2 comments
Labels
a/performance Area: relates to performance of the system c/compute Component: compute, excluding postgres itself c/control-plane Component: Control Plane c/PostgreSQL Component: PostgreSQL features and bugs t/bug Issue Type: Bug

Comments

@Bodobolero
Copy link
Contributor

Bodobolero commented Dec 18, 2024

Steps to reproduce

Run a long-running maintenance task (like create btree index) with parallel workers.
Sporadically these tasks fail during apply_config operation with error message

ERROR: parameter "max_stack_depth" cannot be set during a parallel operation

apply_config uses pg_ctl reload -D to send SIGHUP to postgres and somehow this causes max_stack_depth to be set.

And indeed it seems postgres is ALWAYS setting max_stack_depth in
https://github.com/neondatabase/postgres/blob/97f9fde349c6de6d573f5ce96db07eca60ce6185/src/backend/utils/misc/guc.c#L1585
if (stack_rlimit > 0) && if (new_limit > 100)
when it receives a SIGHUP

I would consider this an upstream bug, because it means that pg_ctl reload command would terminate parallel operations.
For Neon this is extremely bad as we raise SIGHUP in every apply_config operation which is quite frequent

Expected result

Long-running tasks complete as in vanilla postgres (where normally the "system" doesn't send frequent SIGHUP to postgres)

Actual result

Sporadic failure

Environment

Staging, for example endpoint `ep-summer-darkness-w2ldx7r7.us-east-2.aws.neon.build/neondb

Logs, links

Example of failing statement

CREATE UNIQUE INDEX events_pkey ON public.events USING btree (account_id, recorded_at, id);

this is on table (96 GiB):

ludicrous=> \d events
                                 Table "public.events"
    Column     |            Type             | Collation | Nullable |      Default      
---------------+-----------------------------+-----------+----------+-------------------
 account_id    | bigint                      |           | not null | 
 recorded_at   | timestamp without time zone |           | not null | 
 id            | uuid                        |           | not null | gen_random_uuid()
 user_id       | bigint                      |           |          | 
 device_id     | bigint                      |           |          | 
 event_name_id | bigint                      |           | not null | 
 properties    | jsonb                       |           | not null | '{}'::jsonb

when using
-c maintenance_work_mem=8388608 -c max_parallel_maintenance_workers=7

see https://github.com/neondatabase/neon/actions/runs/12293061677/job/34304987910
and discussion here https://neondb.slack.com/archives/C04DGM6SMTM/p1734513223148249?thread_ts=1733997259.898819&cid=C04DGM6SMTM

@Bodobolero Bodobolero added a/performance Area: relates to performance of the system c/compute Component: compute, excluding postgres itself c/control-plane Component: Control Plane c/PostgreSQL Component: PostgreSQL features and bugs t/bug Issue Type: Bug labels Dec 18, 2024
@ololobus
Copy link
Member

For Neon this is extremely bad as we raise SIGHUP in every apply_config operation which is quite frequent

apply_config operations are rather rare. They are performed by cplane and usually triggered by i) user actions; ii) shard re-balancing; iii) storage deploys; BUT you are right that it's especially bad for Neon, just a different reason -- autoscaling agent sends them all the time to bigger computes, because it scales LFC size up and down. And these operations could be much more frequent than `apply_config's

@Bodobolero
Copy link
Contributor Author

Bodobolero commented Dec 18, 2024

apply_config operations are rather rare. ... ii) shard re-balancing;

When we import a project or restore a pg_dump the project grows gradually. Initially the project will not be sharded. Once it reaches a certain size it will be sharded and shard re-balancing takes place causing apply_config operation.
So I think in Neon the apply_config is rather the expected thing to happen during pg_restore or project import for a project size that exceeds the threshold where we shard a project
@ololobus

Given that import is an important business use case for Neon (we want to attract new customers) I consider this a high priority issue and we shouldn't wait for an upstream fix but maybe find a neon specific work-around or patch postgres to only update max_stack_depth if RLIMIT_STACK really has changed from the prior setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/compute Component: compute, excluding postgres itself c/control-plane Component: Control Plane c/PostgreSQL Component: PostgreSQL features and bugs t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants