Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute_ctl: Streamline and Pipeline startup SQL #9717

Merged
merged 11 commits into from
Nov 20, 2024

Conversation

MMeent
Copy link
Contributor

@MMeent MMeent commented Nov 11, 2024

Before, compute_ctl didn't have a good registry for what command would run when, depending exclusively on sync code to apply changes. When users have many databases/roles to manage, this step can take a substantial amount of time, breaking assumptions about low (re)start times in other systems.

This commit reduces the time compute_ctl takes to restart when changes must be applied, by making all commands more or less blind writes, and applying these commands in an asynchronous context, only waiting for completion once we know the commands have all been sent.

Additionally, this reduces time spent by batching per-database operations where previously we would create a new SQL connection for every user-database operation we planned to execute.

Problem

Performance of starting compute with 100s of users and 100s of databases is quite suboptimal. This is one way to reduce the pain.

Summary of changes

  • Split "what to do" and "do it" for compute_ctl's spec apply with SQL commands
  • Rework compute_ctl to use async to apply the SQL changes built in the above system
  • Rework compute_ctl even further to batch all operations done in each database, so that we don't needlessly (re)connect to the same database over and over when a large amount of roles was deleted.

Note for CP reviewer: I don't expect much to have changed on your part, as it is mostly data flow changes in the tool itself, with no definition changes on the Compute Spec side.

Fixes https://github.com/neondatabase/cloud/issues/20461

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Before, compute_ctl didn't have a good registry for what command would run
when, depending exclusively on sync code to apply changes. When users have
many databases/roles to manage, this step can take a substantial amount of
time, breaking assumptions about low (re)start times in other systems.

This commit reduces the time compute_ctl takes to restart when changes must
be applied, by making all commands more or less blind writes, and applying
these commands in an asynchronous context, only waiting for completion once
we know the commands have all been sent.

Additionally, this reduces time spent by batching per-database operations
where previously we would create a new SQL connection for every user-database
operation we planned to execute.
@MMeent MMeent requested review from a team as code owners November 11, 2024 16:35
Copy link
Member

@tristan957 tristan957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice. Any benchmarks?

compute_tools/src/spec_apply.rs Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Nov 11, 2024

5499 tests run: 5273 passed, 0 failed, 226 skipped (full report)


Flaky tests (3)

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.4% (7931 of 25267 functions)
  • lines: 49.3% (62945 of 127623 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
32ea5b3 at 2024-11-19T12:09:59.228Z :recycle:

@clipperhouse
Copy link

Is there a control plane aspect here? Not obvious to me if so.

@tristan957
Copy link
Member

Is there a control plane aspect here? Not obvious to me if so.

Not that I'm aware of. I'm betting you got the ping due to the compute spec reading code.

@clipperhouse clipperhouse removed their request for review November 11, 2024 20:00
Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have general comment not directly related with this PR and which may be should be addressed separately. Right now we applying all this statements concurrently with user statements. In other words, at the moment of executing this statements, database already accepts user's connections and so can execute user's queries.

It may lead to some unexpected and confusing behaviour: for example user may observe old version of Neon extension, because it was not yet upgraded. I actually reproduced such problem myself in one of the tests.

So I wonder if we could (should?) somehow prevent user from accessing this database instance before compute_ctl completes node update? Not sure what is the bestway tp do it:

  • lock some catalog table
  • configure different port
  • somehow block connections at proxy level
    ...

compute_tools/src/compute.rs Show resolved Hide resolved
compute_tools/src/spec_apply.rs Show resolved Hide resolved
@MMeent
Copy link
Contributor Author

MMeent commented Nov 12, 2024

Right now we applying all this statements concurrently with user statements. In other words, at the moment of executing this statements, database already accepts user's connections and so can execute user's queries.

Don't we only start allowing connections from outside the container after we've transitioned to the Running state? During startup we're the sole user of the database, right? At least, that's what's been diagnosed as causing the start_compute timeouts when the spec has many databases and roles...

For apply_config, you are correct, but as that's always on a Running endpoint I don't think we should block user connections during that phase.

[F]or example user may observe old version of Neon extension, because it was not yet upgraded. I actually reproduced such problem myself in one of the tests.

I think that's more related to neon_local 's (default) lack of applying config changes during startup, as endpoint.json's skip_pg_catalog_updates field is default true, thus never applying the code being modified here.

The previous iteration always connected to every database, which slowed down
startup when most databases had no operations scheduled.
Addresses @ tristan957's comment, and allows separate review of these queries.
compute_tools/src/pg_helpers.rs Show resolved Hide resolved
compute_tools/src/spec_apply.rs Outdated Show resolved Hide resolved
compute_tools/src/spec_apply.rs Outdated Show resolved Hide resolved
Instead of Mutex, using an RwLock on shared modifyable state helps speed
up read-only parallel steps (i.e. most parallel steps). This makes the
per-database jobs actually work in parallel, rather than serialize on
accessing the database- and roles tables.

Additionally, this patch reduces the number of parallel workers to 1/3rd of
max_connections, from max_connections - 10. With the previous setting I
consistently hit issues with "ERROR: no unpinned buffers available", now
it works like a charm.
@MMeent
Copy link
Contributor Author

MMeent commented Nov 15, 2024

Performace review

No-op sync of 1024 roles and 1024 databases, on my WSL box (16vCPU, enough RAM/SSD):

  • main @ 4534f5c: took 5m30s after first connection to user database (5m38s incl. all overheads), ~10-20% max across all cores.
  • This PR @ c20e452: took 15s after first connection to user database (21s incl. all overheads), 100% across all cores.

Config

  1. postgresql.conf updated with these settings:

    shared_buffers=128MB
    neon.max_file_cache_size=128MB
    neon.file_cache_size_limit=128MB
    neon.file_cache_path='/tmp/pg_cache'
    log_connections=on
    log_statement=all
    
  2. I'd updated the code of neon_cloud locally to generate num_dbs bogus databases and num_roles bogus roles.

First I let the system apply the changes, then shut the database down. Finally, measure start time with nothing to be done.

Future work

Many of the modifying queries generated for the system database (e.g. role- and database creation, deletions) can be parallellized within their respective phases, too. Maybe we can look into that at a later date, as creating 1k databases sequentially takes a lot of time (and are thus likely to hit start_compute timeouts).

@MMeent
Copy link
Contributor Author

MMeent commented Nov 18, 2024

@clipperhouse could you check this PR and approve it if/when you determine this doesn't produce any issues for Control Plane?

compute_tools/src/compute.rs Show resolved Hide resolved
compute_tools/src/spec_apply.rs Show resolved Hide resolved
compute_tools/src/pg_helpers.rs Outdated Show resolved Hide resolved
@clipperhouse
Copy link

@clipperhouse could you check this PR and approve it if/when you determine this doesn't produce any issues for Control Plane?

Any API contract changes here, from the perspective of control plane as caller or callee? My read is that we should mostly see faster responses, and possibly better concurrency?

cc’ing @mtyazici to have a look.

@mtyazici
Copy link
Contributor

mtyazici commented Nov 18, 2024

Any API contract changes here, from the perspective of control plane as caller or callee? My read is that we should mostly see faster responses, and possibly better concurrency?

cc’ing @mtyazici to have a look.

AFAIS, this PR doesn't change any behavior with cplane -> compute communication. So, I don't think it needs our review. Please let us know if otherwise cc @MMeent

@MMeent
Copy link
Contributor Author

MMeent commented Nov 18, 2024

AFAIS, this PR doesn't change any behavior with cplane -> compute communication. So, I don't think it needs our review. Please let us know if otherwise

The files touched involve CPlane API functionality, thus are marked with CodeOwners as requiring team-CPlane's review before it can be merged.

@MMeent MMeent requested a review from tristan957 November 19, 2024 12:56
@MMeent MMeent enabled auto-merge (squash) November 19, 2024 17:08
@MMeent MMeent merged commit ea1858e into main Nov 20, 2024
78 checks passed
@MMeent MMeent deleted the perf/compute_ctl-async-sql branch November 20, 2024 01:14
github-merge-queue bot pushed a commit that referenced this pull request Nov 28, 2024
## Problem

We used `set_path()` to replace the database name in the connection
string. It automatically does url-safe encoding if the path is not
already encoded, but it does it as per the URL standard, which assumes
that tabs can be safely removed from the path without changing the
meaning of the URL. See, e.g.,
https://url.spec.whatwg.org/#concept-basic-url-parser. It also breaks
for DBs with properly %-encoded names, like with `%20`, as they are kept
intact, but actually should be escaped.

Yet, this is not true for Postgres, where it's completely valid to have
trailing tabs in the database name.

I think this is the PR that caused this regression
#9717, as it switched from
`postgres::config::Config` back to `set_path()`.

This was fixed a while ago already [1], btw, I just haven't added a test
to catch this regression back then :(

## Summary of changes

This commit changes the code back to use
`postgres/tokio_postgres::Config` everywhere.

While on it, also do some changes around, as I had to touch this code:
1. Bump some logging from `debug` to `info` in the spec apply path. We
do not use `debug` in prod, and it was tricky to understand what was
going on with this bug in prod.
2. Refactor configuration concurrency calculation code so it was
reusable. Yet, still keep `1` in the case of reconfiguration. The
database can be actively used at this moment, so we cannot guarantee
that there will be enough spare connection slots, and the underlying
code won't handle connection errors properly.
3. Simplify the installed extensions code. It was spawning a blocking
task inside async function, which doesn't make much sense. Instead, just
have a main sync function and call it with `spawn_blocking` in the API
code -- the only place we need it to be async.
4. Add regression python test to cover this and related problems in the
future. Also, add more extensive testing of schema dump and DBs and
roles listing API.

[1]:
4d1e48f
[2]:
https://www.postgresql.org/message-id/flat/20151023003445.931.91267%40wrigleys.postgresql.org

Resolves neondatabase/cloud#20869
awarus pushed a commit that referenced this pull request Dec 5, 2024
## Problem

We used `set_path()` to replace the database name in the connection
string. It automatically does url-safe encoding if the path is not
already encoded, but it does it as per the URL standard, which assumes
that tabs can be safely removed from the path without changing the
meaning of the URL. See, e.g.,
https://url.spec.whatwg.org/#concept-basic-url-parser. It also breaks
for DBs with properly %-encoded names, like with `%20`, as they are kept
intact, but actually should be escaped.

Yet, this is not true for Postgres, where it's completely valid to have
trailing tabs in the database name.

I think this is the PR that caused this regression
#9717, as it switched from
`postgres::config::Config` back to `set_path()`.

This was fixed a while ago already [1], btw, I just haven't added a test
to catch this regression back then :(

## Summary of changes

This commit changes the code back to use
`postgres/tokio_postgres::Config` everywhere.

While on it, also do some changes around, as I had to touch this code:
1. Bump some logging from `debug` to `info` in the spec apply path. We
do not use `debug` in prod, and it was tricky to understand what was
going on with this bug in prod.
2. Refactor configuration concurrency calculation code so it was
reusable. Yet, still keep `1` in the case of reconfiguration. The
database can be actively used at this moment, so we cannot guarantee
that there will be enough spare connection slots, and the underlying
code won't handle connection errors properly.
3. Simplify the installed extensions code. It was spawning a blocking
task inside async function, which doesn't make much sense. Instead, just
have a main sync function and call it with `spawn_blocking` in the API
code -- the only place we need it to be async.
4. Add regression python test to cover this and related problems in the
future. Also, add more extensive testing of schema dump and DBs and
roles listing API.

[1]:
4d1e48f
[2]:
https://www.postgresql.org/message-id/flat/20151023003445.931.91267%40wrigleys.postgresql.org

Resolves neondatabase/cloud#20869
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants