Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable readstream's reliance on seqscan readahead #9860

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

MMeent
Copy link
Contributor

@MMeent MMeent commented Nov 22, 2024

Neon doesn't have seqscan detection of its own, so stop read_stream from trying to utilize that readahead, and instead make it issue readahead of its own.

Problem

@knizhnik noticed that we didn't issue smgrprefetch[v] calls for seqscans in PG17 due to the move to the read_stream API, which assumes that the underlying IO facilities do seqscan detection for readahead. That is a wrong assumption when Neon is involved, so let's remove the code that applies that assumption.

Summary of changes

Remove the cases where seqscans are detected and prefetch is disabled as a consequence, and instead don't do that detection.

PG PR: neondatabase/postgres#532

@MMeent MMeent requested a review from knizhnik November 22, 2024 19:00
@MMeent MMeent requested a review from a team as a code owner November 22, 2024 19:00
Copy link

github-actions bot commented Nov 22, 2024

7073 tests run: 6758 passed, 0 failed, 315 skipped (full report)


Flaky tests (4)

Postgres 17

Postgres 15

  • test_physical_replication_config_mismatch_too_many_known_xids: release-arm64
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-timeline-delete-after-rm]: release-arm64

Postgres 14

Code coverage* (full report)

  • functions: 31.4% (8333 of 26537 functions)
  • lines: 47.7% (65625 of 137574 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
4995f7f at 2024-12-10T23:04:03.337Z :recycle:

@knizhnik
Copy link
Contributor

Unfortunately the problem with prefetch in PG17 still can not be considered to be resolved.
Yes, with 9860 prefetch is performed and explain (prefetch) shows 100% prefetch hit rate.
But as far as I understand, now prefetch is performed using StartReadBuffersImpl :

	if (flags & READ_BUFFERS_ISSUE_ADVICE)
	{
		/*
		 * In theory we should only do this if PinBufferForBlock() had to
		 * allocate new buffers above.  That way, if two calls to
		 * StartReadBuffers() were made for the same blocks before
		 * WaitReadBuffers(), only the first would issue the advice. That'd be
		 * a better simulation of true asynchronous I/O, which would only
		 * start the I/O once, but isn't done here for simplicity.  Note also
		 * that the following call might actually issue two advice calls if we
		 * cross a segment boundary; in a true asynchronous version we might
		 * choose to process only one real I/O at a time in that case.
		 */
		smgrprefetch(operation->smgr,
					 operation->forknum,
					 blockNum,
					 operation->io_buffers_len);
	}

But it means that prefetch distance is always 1 and effective_io_concurrency is not take in account.
So no wonder that nax in-flight requests in pg-17 is alays 1.
And this is why pg16 is 5 (five!) times slower on some queries.
I did some local microbenchmarking, simulating 0.2msec network latency:

create table t(pk integer, filler text default repeat('?', 200));
insert into t (pk) values (generate_series(1,100000));
set effective_io_concurrency=100;
set max_parallel_workers_per_gather=0; 
explain (analyze,prefetch,buffers) select sum(pk) from t;

Time of query execution in pg16: 654 msec
Result of pg17: 3570 msec

Reply…
Also send to proj-neon-postgres-v17

product

@hlinnaka
Copy link
Contributor

Unfortunately the problem with prefetch in PG17 still can not be considered to be resolved. Yes, with 9860 prefetch is performed and explain (prefetch) shows 100% prefetch hit rate. But as far as I understand, now prefetch is performed using StartReadBuffersImpl :

	if (flags & READ_BUFFERS_ISSUE_ADVICE)
	{
		/*
		 * In theory we should only do this if PinBufferForBlock() had to
		 * allocate new buffers above.  That way, if two calls to
		 * StartReadBuffers() were made for the same blocks before
		 * WaitReadBuffers(), only the first would issue the advice. That'd be
		 * a better simulation of true asynchronous I/O, which would only
		 * start the I/O once, but isn't done here for simplicity.  Note also
		 * that the following call might actually issue two advice calls if we
		 * cross a segment boundary; in a true asynchronous version we might
		 * choose to process only one real I/O at a time in that case.
		 */
		smgrprefetch(operation->smgr,
					 operation->forknum,
					 blockNum,
					 operation->io_buffers_len);
	}

But it means that prefetch distance is always 1 and effective_io_concurrency is not take in account.

effective_io_concurrency is taken into account when the read stream is initialized (read_stream_begin_relation). The read stream facility makes many StartReadBuffers() calls before waiting for the reads to complete with WaitReadBuffers(). So there are always many I/Os in flight, started with StartReadBuffers() but not yet completed by WaitReadBuffers(). The number of such in-flight I/Os is controlled by max_ios, which is derived from effective_io_concurrency.

So no wonder that nax in-flight requests in pg-17 is alays 1. And this is why pg16 is 5 (five!) times slower on some queries. I did some local microbenchmarking, simulating 0.2msec network latency:

Ok, this clearly needs investigation then...

@knizhnik
Copy link
Contributor

Ok, this clearly needs investigation then...

Ok, I have investigated it.
So max_iops=effective_io_concurrency=100.
Fort seqscan we use ring buffer strategy with just 16 dedicated buffers. So

strategy_pin_limit = GetAccessStrategyPinLimit(strategy);

restricts number of pinned buyers to 16.
But then LimitAdditionalPins(&max_pinned_buffers) ids called which divides NBuffers by MaxBackends and get 1 for 1Mb shared buffers used for testing.

If I increase shared buffers size to 128Mb, then prefetch distance is set to 16 and obviously we got much better query execution time: 627 msec vs. 3982 for 1Mb shared buffer. But it is still larger than 567 for pg16.

I still not sure that approach with pinning prefetch buffers is goof for us:

  1. 100 connections * 100 pinned buffers (effective_io_concurrency=100) occupy 80Mb of shared buffers. It is smaller than 128Mb used currently at prod. But still, it stignficanrtly increase risk of lack of free shared buffers error.
  2. Postgres eviction algorithm (CLOCK) becomes very inefficient with largem number of pinned buffers.
  3. Size of seqscan ring is expected to be quite small (16 by default), so we can not expect large prefetch distance.
  4. If we implement get_page requests collapsing in PS, the lather prefetch distance is, the better will be performance.
  5. In case of large prefetchdistance (i.e. 100, if we managed somehow to support it), pinning prefetch buffer means that other query which tries to access this page will have to wait until all previous prefetch requests are completed (100 in the worst case). It can cause significant increase of simple query execution time. Consider OLTP request running concurrently with some OLAP requests.

@knizhnik
Copy link
Contributor

knizhnik commented Nov 28, 2024

I performed some benchmarking on standing competing speed of efficiency of prefetch on PG16 and PG17.
All do at stanging with 120CU fixed size compute and 30Gb (scale 2000) pgbench database. It has large LFC, but it is not important for this tests, because I restart compute before each query.

So I performed three experiments:

  1. pg_prewarm('pgbench_accounts')
  2. Partallel seqscan:
set max_parallel_workers=10;
set max_parallel_workers_per_gather=10;
select sum(abalance) from pgbench_account;
  1. Synchronised seqscan:
    I run 10 concurrent queries set max_parallel_workers_per_gather=0; select sum(abalance) from pgbench_account;

Results (seconds):

pg16:
pgprewarm: 533 seconds
seqscan: 523 seconds
parallel seqscan: 123 seconds
synchronized seqscans: 1717 seconds


pg17 (#9860):
pg_prewarm: 2117 seconds
seqscan: 2210 seconds
parallel seqscan: 257 seconds
synchronized seqscans: > 4 hours

pg17 (#9865):
pg_prewarm: 2396 seconds
seqscan: 451 seconds
parallel seqscan: 214 seconds
synchronized seqscans: 2100 seconds 

pg17 (#9860 + io_combine_limit=1):
pgprewarm: 672 seconds
parallel seqscan: 156 seconds
seqscan: 457 seconds
synchronized seqscan: 1804 seconds

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is ready to be committed. Konstantin's benchmark suggests that we should also set io_ might want to also set io_combine_limit=1. We can do that as a separate PR, or include it here, but what's in this PR seems correct and a step forward in any case.

Neon doesn't have seqscan detection of its own, so stop read_stream
from trying to utilize that readahead, and instead make it issue
readahead of its own.
@MMeent MMeent force-pushed the fix/pg17-read-stream-readahead branch from ee1d7a1 to 4995f7f Compare December 10, 2024 21:44
@MMeent MMeent enabled auto-merge December 10, 2024 21:53
@MMeent MMeent added this pull request to the merge queue Dec 11, 2024
Merged via the queue into main with commit 597125e Dec 11, 2024
82 checks passed
@MMeent MMeent deleted the fix/pg17-read-stream-readahead branch December 11, 2024 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants