Disable readstream's reliance on seqscan readahead #9860

MMeent · 2024-11-22T19:00:31Z

Neon doesn't have seqscan detection of its own, so stop read_stream from trying to utilize that readahead, and instead make it issue readahead of its own.

Problem

@knizhnik noticed that we didn't issue smgrprefetch[v] calls for seqscans in PG17 due to the move to the read_stream API, which assumes that the underlying IO facilities do seqscan detection for readahead. That is a wrong assumption when Neon is involved, so let's remove the code that applies that assumption.

Summary of changes

Remove the cases where seqscans are detected and prefetch is disabled as a consequence, and instead don't do that detection.

PG PR: neondatabase/postgres#532

github-actions · 2024-11-22T19:54:44Z

7073 tests run: 6758 passed, 0 failed, 315 skipped (full report)

Flaky tests (4)

Postgres 17

test_check_visibility_map: debug-x86-64

Postgres 15

test_physical_replication_config_mismatch_too_many_known_xids: release-arm64
test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-timeline-delete-after-rm]: release-arm64

Postgres 14

test_pull_timeline[True]: release-x86-64

Code coverage* (full report)

functions: 31.4% (8333 of 26537 functions)
lines: 47.7% (65625 of 137574 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
4995f7f at 2024-12-10T23:04:03.337Z :recycle:}

knizhnik · 2024-11-23T10:34:36Z

Unfortunately the problem with prefetch in PG17 still can not be considered to be resolved.
Yes, with 9860 prefetch is performed and explain (prefetch) shows 100% prefetch hit rate.
But as far as I understand, now prefetch is performed using StartReadBuffersImpl :

	if (flags & READ_BUFFERS_ISSUE_ADVICE)
	{
		/*
		 * In theory we should only do this if PinBufferForBlock() had to
		 * allocate new buffers above.  That way, if two calls to
		 * StartReadBuffers() were made for the same blocks before
		 * WaitReadBuffers(), only the first would issue the advice. That'd be
		 * a better simulation of true asynchronous I/O, which would only
		 * start the I/O once, but isn't done here for simplicity.  Note also
		 * that the following call might actually issue two advice calls if we
		 * cross a segment boundary; in a true asynchronous version we might
		 * choose to process only one real I/O at a time in that case.
		 */
		smgrprefetch(operation->smgr,
					 operation->forknum,
					 blockNum,
					 operation->io_buffers_len);
	}

But it means that prefetch distance is always 1 and effective_io_concurrency is not take in account.
So no wonder that nax in-flight requests in pg-17 is alays 1.
And this is why pg16 is 5 (five!) times slower on some queries.
I did some local microbenchmarking, simulating 0.2msec network latency:

create table t(pk integer, filler text default repeat('?', 200));
insert into t (pk) values (generate_series(1,100000));
set effective_io_concurrency=100;
set max_parallel_workers_per_gather=0; 
explain (analyze,prefetch,buffers) select sum(pk) from t;

Time of query execution in pg16: 654 msec
Result of pg17: 3570 msec

Reply…
Also send to proj-neon-postgres-v17

product

hlinnaka · 2024-11-25T08:14:04Z

Unfortunately the problem with prefetch in PG17 still can not be considered to be resolved. Yes, with 9860 prefetch is performed and explain (prefetch) shows 100% prefetch hit rate. But as far as I understand, now prefetch is performed using StartReadBuffersImpl :
	if (flags & READ_BUFFERS_ISSUE_ADVICE)
	{
		/*
		 * In theory we should only do this if PinBufferForBlock() had to
		 * allocate new buffers above.  That way, if two calls to
		 * StartReadBuffers() were made for the same blocks before
		 * WaitReadBuffers(), only the first would issue the advice. That'd be
		 * a better simulation of true asynchronous I/O, which would only
		 * start the I/O once, but isn't done here for simplicity.  Note also
		 * that the following call might actually issue two advice calls if we
		 * cross a segment boundary; in a true asynchronous version we might
		 * choose to process only one real I/O at a time in that case.
		 */
		smgrprefetch(operation->smgr,
					 operation->forknum,
					 blockNum,
					 operation->io_buffers_len);
	}
But it means that prefetch distance is always 1 and effective_io_concurrency is not take in account.

effective_io_concurrency is taken into account when the read stream is initialized (read_stream_begin_relation). The read stream facility makes many StartReadBuffers() calls before waiting for the reads to complete with WaitReadBuffers(). So there are always many I/Os in flight, started with StartReadBuffers() but not yet completed by WaitReadBuffers(). The number of such in-flight I/Os is controlled by max_ios, which is derived from effective_io_concurrency.

So no wonder that nax in-flight requests in pg-17 is alays 1. And this is why pg16 is 5 (five!) times slower on some queries. I did some local microbenchmarking, simulating 0.2msec network latency:

Ok, this clearly needs investigation then...

knizhnik · 2024-11-25T13:11:16Z

Ok, this clearly needs investigation then...

Ok, I have investigated it.
So max_iops=effective_io_concurrency=100.
Fort seqscan we use ring buffer strategy with just 16 dedicated buffers. So

strategy_pin_limit = GetAccessStrategyPinLimit(strategy);

restricts number of pinned buyers to 16.
But then LimitAdditionalPins(&max_pinned_buffers) ids called which divides NBuffers by MaxBackends and get 1 for 1Mb shared buffers used for testing.

If I increase shared buffers size to 128Mb, then prefetch distance is set to 16 and obviously we got much better query execution time: 627 msec vs. 3982 for 1Mb shared buffer. But it is still larger than 567 for pg16.

I still not sure that approach with pinning prefetch buffers is goof for us:

100 connections * 100 pinned buffers (effective_io_concurrency=100) occupy 80Mb of shared buffers. It is smaller than 128Mb used currently at prod. But still, it stignficanrtly increase risk of lack of free shared buffers error.
Postgres eviction algorithm (CLOCK) becomes very inefficient with largem number of pinned buffers.
Size of seqscan ring is expected to be quite small (16 by default), so we can not expect large prefetch distance.
If we implement get_page requests collapsing in PS, the lather prefetch distance is, the better will be performance.
In case of large prefetchdistance (i.e. 100, if we managed somehow to support it), pinning prefetch buffer means that other query which tries to access this page will have to wait until all previous prefetch requests are completed (100 in the worst case). It can cause significant increase of simple query execution time. Consider OLTP request running concurrently with some OLAP requests.

knizhnik · 2024-11-28T15:29:48Z

I performed some benchmarking on standing competing speed of efficiency of prefetch on PG16 and PG17.
All do at stanging with 120CU fixed size compute and 30Gb (scale 2000) pgbench database. It has large LFC, but it is not important for this tests, because I restart compute before each query.

So I performed three experiments:

pg_prewarm('pgbench_accounts')
Partallel seqscan:

set max_parallel_workers=10;
set max_parallel_workers_per_gather=10;
select sum(abalance) from pgbench_account;

Synchronised seqscan:
I run 10 concurrent queries set max_parallel_workers_per_gather=0; select sum(abalance) from pgbench_account;

Results (seconds):

pg16:
pgprewarm: 533 seconds
seqscan: 523 seconds
parallel seqscan: 123 seconds
synchronized seqscans: 1717 seconds


pg17 (#9860):
pg_prewarm: 2117 seconds
seqscan: 2210 seconds
parallel seqscan: 257 seconds
synchronized seqscans: > 4 hours

pg17 (#9865):
pg_prewarm: 2396 seconds
seqscan: 451 seconds
parallel seqscan: 214 seconds
synchronized seqscans: 2100 seconds 

pg17 (#9860 + io_combine_limit=1):
pgprewarm: 672 seconds
parallel seqscan: 156 seconds
seqscan: 457 seconds
synchronized seqscan: 1804 seconds

hlinnaka

I believe this is ready to be committed. Konstantin's benchmark suggests that we should also set io_ might want to also set io_combine_limit=1. We can do that as a separate PR, or include it here, but what's in this PR seems correct and a step forward in any case.

Neon doesn't have seqscan detection of its own, so stop read_stream from trying to utilize that readahead, and instead make it issue readahead of its own.

MMeent requested a review from knizhnik November 22, 2024 19:00

MMeent requested a review from a team as a code owner November 22, 2024 19:00

knizhnik approved these changes Nov 22, 2024

View reviewed changes

knizhnik mentioned this pull request Nov 24, 2024

Fix seqscan prefetch in pg17 #9865

Closed

ololobus mentioned this pull request Nov 26, 2024

Epic: stabilize Postgres prefetch #9893

Closed

MMeent mentioned this pull request Dec 9, 2024

Disable readstream's reliance on seqscan readahead neondatabase/postgres#532

Merged

hlinnaka approved these changes Dec 10, 2024

View reviewed changes

Disable readstream's reliance on seqscan readahead

4995f7f

Neon doesn't have seqscan detection of its own, so stop read_stream from trying to utilize that readahead, and instead make it issue readahead of its own.

MMeent force-pushed the fix/pg17-read-stream-readahead branch from ee1d7a1 to 4995f7f Compare December 10, 2024 21:44

MMeent enabled auto-merge December 10, 2024 21:53

MMeent added this pull request to the merge queue Dec 11, 2024

Merged via the queue into main with commit 597125e Dec 11, 2024
82 checks passed

MMeent deleted the fix/pg17-read-stream-readahead branch December 11, 2024 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable readstream's reliance on seqscan readahead #9860

Disable readstream's reliance on seqscan readahead #9860

MMeent commented Nov 22, 2024 •

edited

Loading

github-actions bot commented Nov 22, 2024 •

edited

Loading

Postgres 17

Postgres 15

Postgres 14

knizhnik commented Nov 23, 2024

hlinnaka commented Nov 25, 2024

knizhnik commented Nov 25, 2024

knizhnik commented Nov 28, 2024 •

edited

Loading

hlinnaka left a comment

Disable readstream's reliance on seqscan readahead #9860

Disable readstream's reliance on seqscan readahead #9860

Conversation

MMeent commented Nov 22, 2024 • edited Loading

Problem

Summary of changes

github-actions bot commented Nov 22, 2024 • edited Loading

7073 tests run: 6758 passed, 0 failed, 315 skipped (full report)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

knizhnik commented Nov 23, 2024

hlinnaka commented Nov 25, 2024

knizhnik commented Nov 25, 2024

knizhnik commented Nov 28, 2024 • edited Loading

hlinnaka left a comment

Choose a reason for hiding this comment

MMeent commented Nov 22, 2024 •

edited

Loading

github-actions bot commented Nov 22, 2024 •

edited

Loading

knizhnik commented Nov 28, 2024 •

edited

Loading