Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_pgdata_import_smoke timeouts #10071

Closed
jcsp opened this issue Dec 10, 2024 · 3 comments · Fixed by #10114
Closed

test_pgdata_import_smoke timeouts #10071

jcsp opened this issue Dec 10, 2024 · 3 comments · Fixed by #10114
Assignees

Comments

@jcsp
Copy link
Collaborator

jcsp commented Dec 10, 2024

https://neon-github-public-dev.s3.amazonaws.com/reports/main/12253951655/index.html#testresult/3359205196681dba/retries

test_runner/regress/test_import_pgdata.py:305: in test_pgdata_import_smoke
    validate_vanilla_equivalence(br_initdb_endpoint)
test_runner/regress/test_import_pgdata.py:114: in validate_vanilla_equivalence
    assert ep.safe_psql("select count(*), sum(data::bigint)::bigint from t") == [
test_runner/fixtures/neon_fixtures.py:283: in safe_psql
    return self.safe_psql_many([query], **kwargs)[0]
test_runner/fixtures/neon_fixtures.py:298: in safe_psql_many
    cur.execute(query)
/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/encodings/utf_8.py:15: in decode
    def decode(input, errors='strict'):
E   Failed: Timeout >300.0s
@jcsp jcsp self-assigned this Dec 10, 2024
@jcsp
Copy link
Collaborator Author

jcsp commented Dec 10, 2024

Typical runtimes are ~210s

@jcsp
Copy link
Collaborator Author

jcsp commented Dec 10, 2024

Timeout is always during the last validate_vanilla_equivalent call onto the br-initdb-ro endpoint.

We give it a basebackup successfully:

2024-12-10T10:46:37.657219Z  INFO page_service_conn_main{peer_addr=127.0.0.1:40632}:process_query{tenant_id=42884b4d19e567b90afaf354195a9e50 timeline_id=b094f8a43aa6d5e6a4d7f947a7dbcb0b}:handle_basebackup_request{lsn=None prev_lsn=None full_backup=false shard_id=0000}: basebackup complete lsn_await_millis=0 basebackup_millis=27

Endpoint logs don't say much:

PG:2024-12-10 10:46:37.880 GMT [142957] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:15060' with protocol version 2
2024-12-10T10:46:38.000833Z  INFO [NEON_EXT_STAT] {"extensions":[{"extname":"plpgsql","versions":["1.0"],"n_databases":2}]}

... time passes...

PG:2024-12-10 10:47:37.812 GMT [154386] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:15060' with protocol version 2
PG:2024-12-10 10:47:50.205 GMT [142934] LOG:  received fast shutdown request

@jcsp
Copy link
Collaborator Author

jcsp commented Dec 12, 2024

The test's runtime is dominated by the simple postgres query that sums a bunch of integers. The test intentionally bloats out the data to take a lot of pages (~300k), and the reads are not batched because tests run with effective_io_concurrency=1, so we end up CPU bottlenecked on the pageserver, making this test rather sensitive to background CPU load.

github-merge-queue bot pushed a commit that referenced this issue Dec 19, 2024
…fective_io_concurrency=2` in tests by default (#10114)

## Problem

`test_pgdata_import_smoke` writes two gigabytes of pages and then reads
them back serially. This is CPU bottlenecked and results in a long
runtime, and sensitivity to CPU load from other tests on the same
machine.

Closes: #10071

## Summary of changes

- Use effective_io_concurrency=32 when doing sequential scans through
2GiB of pages in test_pgdata_import_smoke. This is a ~10x runtime
decrease in the parts of the test that do sequential scans.
- Also set `effective_io_concurrency=2` for tests, as I noticed while
debugging that we were doing all getpage requests serially, which is bad
for checking the stability of the batching code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant