Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587

knizhnik · 2024-10-31T08:22:17Z

Problem

Lightweight version of PR #9197

Summary of changes

Now there just two functions for getting LFC state and prewarming.
AUX mechanism is not used and BGW is not started.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-10-31T09:08:51Z

7117 tests run: 6813 passed, 0 failed, 304 skipped (full report)

Flaky tests (1)

Postgres 15

test_lr_with_slow_safekeeper: release-x86-64

Code coverage* (full report)

functions: 31.2% (8399 of 26877 functions)
lines: 48.0% (66679 of 139059 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
cca517f at 2024-12-23T10:36:36.236Z :recycle:}

knizhnik · 2024-11-19T07:58:50Z

First results of prewarming of large nodes.
Actually not so large, because I faieled to create node with LFC size greater than 38Gb.
Right now I tested just pgbench scale 2000 (~30Gb database).

And results are very ... don't know proper word: not unexpected, because I expect something like this but not so much, not so upset, because it demenstrate importance of prewarm, and not exciting because something is definitely wrong (most likely with LFC).

So the whole database should fit in LFC.
I measured read-only performance using the following command:

  pgbench -c 100 -S -j 4 -M prepared  -T 1000 -P 1

It is launched at compute node itself to minimize network latencies (better run it at some other node in the same cloud, but this is what I do not know now how to do:)
Results after initialization are the following:

progress: 1.0 s, 96545.2 tps, lat 0.977 ms stddev 1.108, 0 failed
progress: 10.0 s, 122544.2 tps, lat 0.813 ms stddev 0.925, 0 failed
progress: 20.0 s, 140329.7 tps, lat 0.711 ms stddev 0.780, 0 failed
progress: 90.0 s, 143409.1 tps, lat 0.696 ms stddev 0.759, 0 failed
progress: 100.0 s, 144945.9 tps, lat 0.688 ms stddev 0.754, 0 failed
tps = 140099.783473 (without initial connection time)

After node restart peformance is very bad and grow slowly:

progress: 1.0 s, 405.0 tps, lat 187.010 ms stddev 184.346, 0 failed
progress: 10.0 s, 22684.7 tps, lat 4.140 ms stddev 7.728, 0 failed
progress: 20.0 s, 3603.0 tps, lat 28.027 ms stddev 38.786, 0 failed
progress: 30.0 s, 29257.7 tps, lat 3.392 ms stddev 2.885, 0 failed
progress: 40.0 s, 3698.0 tps, lat 26.698 ms stddev 19.502, 0 failed
progress: 50.0 s, 5462.2 tps, lat 18.147 ms stddev 17.474, 0 failed
progress: 60.0 s, 34994.5 tps, lat 2.651 ms stddev 2.765, 0 failed
progress: 170.0 s, 13104.6 tps, lat 7.625 ms stddev 14.757, 0 failed
progress: 280.0 s, 25171.5 tps, lat 4.074 ms stddev 12.153, 0 failed
progress: 300.0 s, 72801.1 tps, lat 1.212 ms stddev 2.983, 0 failed
progress: 400.0 s, 88721.7 tps, lat 0.993 ms stddev 2.414, 0 failed
progress: 1000.1 s, 83554.8 tps, lat 1.017 ms stddev 2.460, 0 failed
tps = 74302.574857 (without initial connection time)

What is worser - it is not growing monotonically - there are picks and falls.
After ~500 seconds it reachs level ~80k TPS and reaches at this evel till 1000 seconds.
Then I try to perform manual prewarm:

 explain (analyze,prefetch,buffers) select sum(abalance) from pgbench_accounts;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5778689.00..5778689.01 rows=1 width=8) (actual time=466585.455..466585.457 rows=1 loops=1)
   Buffers: shared read=3278689
   Prefetch: hits=3278590 misses=99 expired=0 duplicates=0
   ->  Seq Scan on pgbench_accounts  (cost=0.00..5278689.00 rows=200000000 width=4) (actual time=1.199..455158.039 rows=200000000 loops=1)
         Buffers: shared read=3278689
         Prefetch: hits=3278590 misses=99 expired=0 duplicates=0
 Planning:
   Buffers: shared hit=50 read=4 dirtied=2
 Planning Time: 12.401 ms
 Execution Time: 466585.510 ms

It took about 500 seconds to scan all table (with efficient_io_concurrency=100 ) after which Ic once again run pgbench and now it almost reaches previous level, but still not immediately:

progress: 1.0 s, 1076.9 tps, lat 71.586 ms stddev 50.345, 0 failed
progress: 2.0 s, 40851.4 tps, lat 2.511 ms stddev 5.352, 0 failed
progress: 3.0 s, 70079.0 tps, lat 1.419 ms stddev 0.823, 0 failed
progress: 4.0 s, 80905.4 tps, lat 1.232 ms stddev 0.820, 0 failed
progress: 5.0 s, 82107.8 tps, lat 1.211 ms stddev 0.910, 0 failed
progress: 6.0 s, 42225.7 tps, lat 2.366 ms stddev 2.636, 0 failed
progress: 7.0 s, 91945.7 tps, lat 1.082 ms stddev 1.145, 0 failed
progress: 8.0 s, 97998.8 tps, lat 1.015 ms stddev 0.950, 0 failed
progress: 9.0 s, 98887.5 tps, lat 1.005 ms stddev 0.996, 0 failed
progress: 10.0 s, 99914.3 tps, lat 0.995 ms stddev 1.211, 0 failed
progress: 20.0 s, 122603.2 tps, lat 0.810 ms stddev 1.240, 0 failed
progress: 30.0 s, 138502.4 tps, lat 0.719 ms stddev 0.835, 0 failed
progress: 100.0 s, 134946.8 tps, lat 0.732 ms stddev 0.849, 0 failed
tps = 125702.532030 (without initial connection time)

MMeent

Some comments on only the re-loading part, not (yet) the state extraction.

MMeent · 2024-12-03T12:17:52Z

pgxn/neon/file_cache.c

+	uint32		access_count : 30;
+	uint32      prewarm_requested : 1; /* entry should be filled by prewarm */
+	uint32      prewarm_started : 1;   /* chunk is written by lfc_prewarm */


I think I'd rather have this combination of bitfields be a separate struct, or a single uint32 accessed through macros.

Additionally, prewarm_requested/prewarm_started are used like bools all around the new code, so those fields should really be bool-typed, not uint32.

Sorry, I do not completely understand your arguments here.
Yes, it is possible to move this bits to separate array, but why? 30 bits is more than enough for access_counter - there will be never milliard backbends. So this bits are just wasting. And size of this structure is quite critical - because it is multiplied by maximal possible LFC size. Certainly it is possible to make it more compact for example by replacing pointers with indexes for LRU list. But still I do not see any good reasons in replacing bit fields with bool.

Also I completely do not understand you concern about using 1-bit field as bool. It is quite common practice in C and in Postgres code also. I do not see anything dangerous, inconvenient tr confusing here.

MMeent · 2024-12-03T12:19:25Z

pgxn/neon/file_cache.c

@@ -149,7 +168,7 @@ lfc_disable(char const *op)
 {
 	int			fd;

-	elog(WARNING, "Failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path);
+	elog(WARNING, "LFC: failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path);


Maybe we could use an lfc_elog() macro which adds the "LFC: " string for you?

Makes sense - will do it.

MMeent · 2024-12-03T12:21:45Z

pgxn/neon/file_cache.c

+typedef struct FileCacheStateEntry
+{
+	BufferTag	key;
+	uint32		bitmap[CHUNK_BITMAP_SIZE];
+} FileCacheStateEntry;


IIUC, this is the entry stored to describe "please prefetch these blocks"; right?

I'm not sure about this layout. Why should't we just prefetch the whole chunk starting at BufferTag.key?

BufferTag is 20 bytes. CHUNK_BITMAP_SIZE with 1Mb chunk needs 16 bytes. So it increase size of FileCacheStateEntry less than two times. But here size seems to be not so important. But reducing about of downloaded pages seems to be important, especially for OLTP kind of workload. In principle, smart loader can check number of set bits in this bitmap and either download all chunk, either individual pages. But right now it is not implemented.

MMeent · 2024-12-03T16:06:28Z

pgxn/neon/file_cache.c

+ * But as far as write to LFC is performed without holding lock, there is no guarantee that no such write is in progress.
+ * This is why second flag is used: `prewarm_started`. It is set by `lfc_prewarm` when is starts writing page and cleared when write is completed.
+ * Any other backend writing to LFC should abandon it's write to LFC file (just not mark page as loaded in bitmap) if this flag is set.
+ * So neither `lfc_prewarm`, neither backend are saving page in LFC in this case - it is just skipped.


Can't we do something like the following:

Prewarmer: With exclusively locked LFC:

Prewarmer: Get-or-insert chunk

Prewarmer: Mark chunk as LFC_PREWARM_STARTED

Prewarmer: copy prewarmer worker number (if relevant) into new prewarm_worker_field on chunk (this needs to be only a few bits, e.g. 8, as we only will have few workers),

Prewarmer: unlink chunk from LRU list, increasing refcount by 1

Prewarmer: Copy refcount into per-prewarmworker state's 'wait_count', reduced by 1 (self).

Release LFC lock.

Now the chunk we're going to put data into is pinned (has been removed from the LRU, can't be reclaimed) and no new backends will start write-IO on the chunk due to the LFC_PREWARM_STARTED flag.

Normal backends: New writes that get to that chunk start waiting on (new) condition variable prewarm_progress in the reported prewarm worker's worker chunk, until the lfc chunk doesn't have LFC_PREWARM_STARTED anymore.

Prewarmer:

wait on new condition variable chunkrelease until the wait_count is empty (no more active concurrent writes)

Write out own data

Check LSNs for data

Set 'written' bit for valid data.

Unset LFC_PREWARM_STARTED

Signal condition variable prewarm_progress

Release and link chunk as normal.

Meanwhile, a normal backend will do:

Acquire a chunk, noting down whether LFC_PREWARM_STARTED is set for it in lpstarted.

If I'm writing and lpstarted, wait on the worker's prewarm_progress Condition Variable until the LFC_PREWARM_STARTED flag is released

Do my thing

While releasing my chunk with exclusive lock, if not lpstarted, but the chunk now has LFC_PREWARM_STARTED set:

Decrease the prewarm worker's wait_count

If wait_count == 0, after releasing the LFC lock, signal the chunk's worker's chunkrelease condition variable.

I think this is robust against concurrent writes to the same block, as we make sure to never have normal writers and the LFC prewarmer write to the same chunk at the same time.

I thought about something like this (make backend wait completion of block load by prewarmer.
But I consider such probability as relatively small (otherwise, if most of the bocks which are loaded by prewarm are in any case accessed by backends, then we do not need prewarm at all).

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Your LFC_PREWARM_STARTED flag servers the same goal as my prewarm_started flag. The main difference is that in your approachm backend will wait prwarm completion while in my case - just skip this entry. Certainly second approach is much more simple while your approach can in some cases provide better performance. But I do not think that this difference will be noticeable in the assumption that probability of such collision is small enough.

(otherwise, if most of the bocks which are loaded by prewarm are in any case accessed by backends, then we do not need prewarm at all).

I'm not sure I follow. We prewarm not just to get the data into LFC, but to have backends hit the LFC rather than having to access the pageserver.

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

That's what the LSN check is for.

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.

Your LFC_PREWARM_STARTED flag servers the same goal as my prewarm_started flag. The main difference is that in your approachm backend will wait prwarm completion while in my case - just skip this entry.

It seems to me that the main difference is that in your case the backend will not write out its page, while in my case it will (after waiting for the prefetch worker). Or do I misunderstand the description?

Note that the design I described is just on how we can merge prewarmed chunks' data into the LFC, it doesn't cover when and how we decide which chunks to prewarm. That'd probably be something along the lines of:

Get next chunk from prewarm state

If the chunk has no current entry in the LFC, and there are no empty chunks left, exit.

If the chunk is already in LFC, remove the pages that are already in the LFC from the list of to-be-fetched pages for this chunk.

Fetch the pages for this chunk, keeping notes on the LSN bounds used to fetch the pages

Merge the pages into the chunk (see Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587 (comment) for algorithm details) while keeping the noted-down LwLSN bounds in mind.

These steps can be executed in superscalar/pipelined manner to maximize IO performance in the prefetching process(es).

I'm not sure I follow. We prewarm not just to get the data into LFC, but to have backends hit the LFC rather than having to access the pageserver.

Sorry, that I was unclear.
So, there can be three difference situations:

OLTP workload with small number of clients

OLTP workload with large number of clients

OLAP workload

In case 3) prewarm is not needed - OLAP queries do it much more efficient themselves.
My experiments shows that in case 2) prewarm is also not needed: larger number of parallel backends can do prewarm much faster than separate prewarm worker.

So actually we need prewarm only in case 3 - where there are just few backends which perform OLTP queries (index scan). If database is larger, it may take quite long time for this few backends to access the whole working set - fill LFC with data. All this time this backends and so client will suffer from large delays (because of get page latency). Only in this case expect prewarm can be useful. But as far as there is small number of backends and larger database, chances that this backends will conflict with prewarm are very small.

This why spending much efforts for more efficient handling of this conflicts seems to be waste of time.

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

That's what the LSN check is for.

Storing 8 bytes per page (not per 1Mb chunk!) seems to be too expensive: with 1TB LFC we will have 128Mb just for storing LSNs. It is the same as we have no for shared buffers.

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.

Not quite clear to me how it can work. So prewarmer get page tp be prewarmed, get its lwlsn mark it as "prewarm started", release lock and start waiting for response from PS. At the same moment some backend can write this page, this pages is thrown aways and then once again retrieved by some backend. This backend will see "prewarming" flag and waits until prewarm is completed. Ok, prewar is completed. How to check now now that load version is what we need? We should remember request LSN somewhere ... but where? Or prewarmer can try check it itself by once again checking lwlsn. But it is once again racy because there is actually no sync between backends and prewarmer.

I have discussed this problem with Heikki. Hw first suggest to perform prewarming through shared buffers - in this case we have proper locking. But I do not want for prewarm to affect shared buffer, especially taken in account that they are small. And once again - if probability of conflict is very small, it is enough to have quite conservative check here and no need to involve conditional vars, lwlsn, ...

LwLSN checks are also not so trivial, taken in account that it is limited in size (entries can be evicted).

So, there can be three difference situations:

OLTP workload with small number of clients

OLTP workload with large number of clients

OLAP workload

In case 3) prewarm is not needed - OLAP queries do it much more efficient themselves.

I only partially agree here - indexes are slow to prewarm and I'd rather have them in memory by the time they're about to get hit.

My experiments shows that in case 2) prewarm is also not needed: larger number of parallel backends can do prewarm much faster than separate prewarm worker.

I'm not sure about that. Your tests with prewarming have so far been using pg_prewarm, with whole relations at a time. That's not exactly representative of loading the previous compute's LFC chunks in order of most recent usage (inverse RLU), concurrent to the active workload.

Furthermore, so far testing has used pgbench, in normal mode. While pgbench is one tool for benchmarking performance, it's not exactly comparable to joining OLTP workloads, which are much more sensitive to IO in index lookup paths; and not very comparable to session-based workloads that often have some time between commit and the next transaction starting. A rate-based (rather than throughput-based) pgbench run might show more favorable results.

So actually we need prewarm only in case 3 - where there are just few backends which perform OLTP queries (index scan). If database is larger, it may take quite long time for this few backends to access the whole working set - fill LFC with data. All this time this backends and so client will suffer from large delays (because of get page latency). Only in this case expect prewarm can be useful. But as far as there is small number of backends and larger database, chances that this backends will conflict with prewarm are very small.

This why spending much efforts for more efficient handling of this conflicts seems to be waste of time.

With many clients it's quite likely the prewarmer was busy fetching pages that weren't yet in LFC, and which are useful to other sessions. Explicitly dropping those pages sounds like a waste of time and resources, given that any backend that might further request that page is now going to have to request it again with your design, and I think that's a larger waste of resources.

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

That's what the LSN check is for.

Storing 8 bytes per page (not per 1Mb chunk!) seems to be too expensive: with 1TB LFC we will have 128Mb just for storing LSNs. It is the same as we have no for shared buffers.

We don't store LSNs for pages in the LFC? We only have an LwLSN cache inside PostgreSQL, which we'll keep in mind when we fetch a chunk's pages. We don't have to store that data anywhere else; it's only used to synchronize backends so that we can't miss concurrent page updates for pages that get read; modified; flushed; evicted from LFC in the time it takes us to prefetch that page (which is extremely unlikely, but possible).

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.

Not quite clear to me how it can work. So prewarmer get page tp be prewarmed, get its lwlsn mark it as "prewarm started", release lock and start waiting for response from PS. At the same moment some backend can write this page, this pages is thrown aways and then once again retrieved by some backend. This backend will see "prewarming" flag and waits until prewarm is completed. Ok, prewar is completed. How to check now now that load version is what we need? We should remember request LSN somewhere ... but where? Or prewarmer can try check it itself by once again checking lwlsn. But it is once again racy because there is actually no sync between backends and prewarmer.

Not exactly. What we'd do is

Select a chunk to prewarm.
This requires the chunk to be present in LFC, or an empty chunk to be available in the LFC that we can then fill.
If there are no empty chunks left, and the chunk we selected isn't present in the LFC, we skip this chunk as evicting data to prewarm other data may be detrimental to the workload.

Unlink the selected chunk from the LRU
Now it can't be removed through LRU eviction, so we know we can read and write data into it without many issues.

Note down which blocks are already written to the chunk
Now we don't have to read those blocks, as they're already the latest version

Unlock the LFC

Get LwLSNs for all blocks we're about to prefetch

Fetch the blocks with those LwLSNs from PS into local memory

Start merging the data into the LFC entry (adapted from earlier algo)

With exclusively locked LFC:

Mark chunk as LFC_PREWARM_STARTED

Copy prewarmer worker number (if relevant) into new prewarm_worker_field on chunk (this needs to be only a few bits, e.g. 8, as we only will have few workers),

unlink chunk from LRU list, increasing refcount by 1

Copy refcount into per-prewarmworker state's 'wait_count', reduced by 1 (self).

Release LFC lock.

Now the chunk we're going to put data into is pinned (has been removed from the LRU, can't be reclaimed) and no new backends will start write-IO on pages that aren't present in the chunk due to the LFC_PREWARM_STARTED flag.

Normal backends: Writes for new pages that get to that chunk start waiting on (new) condition variable prewarm_progress in the reported prewarm worker's worker chunk, until the lfc chunk doesn't have LFC_PREWARM_STARTED anymore.

Prewarmer:

wait on new condition variable chunkrelease until the wait_count is empty (no more active concurrent writes)

Write out own data into pages that aren't yet present in the chunk

~~Check LSNs for data~~ can be ignored because no page could have been modified in the meantime, as page loads and modifications must have written a page to the LFC

Set 'written' bit for valid data.

Unset LFC_PREWARM_STARTED

Signal condition variable prewarm_progress

Release and link chunk as normal.

Here, only concurrent writers, and no other processes, wait only when the prewarm process while the prewarmer is writing data into the chunk - readers can always read any data that's already present (but may just also decide to wait for the prewarmer to complete if the chunk doesn't yet contain the page they're looking for).

I have discussed this problem with Heikki. Hw first suggest to perform prewarming through shared buffers - in this case we have proper locking.

The locking would be as safe as can be, yes.

But I do not want for prewarm to affect shared buffer, especially taken in account that they are small.
Agreed.

And once again - if probability of conflict is very small, it is enough to have quite conservative check here and no need to involve conditional vars, lwlsn, ...

LwLSN checks are also not so trivial, taken in account that it is limited in size (entries can be evicted).

How and why would we be allowed to skip doing the LWLSN checks in the general case?

knizhnik · 2024-12-06T18:58:43Z

I only partially agree here - indexes are slow to prewarm and I'd rather have them in memory by the time they're about to get hit.

Certainly it depends on workload. But I made this conclusion based on the results of my experiments with prewarming using pgbench -S. 100 clients with standard read-only pgbench workload do prewarming faster than single prewarmer.
And it does just simple index scans with random key (worst case for prewarm).

MMeent · 2024-12-06T19:02:43Z

And it does just simple index scans with random key (worst case for prewarm).

I think that's wrong; pg_prewarm('relation') doesn't actually have the same behavior as autoprewarm or LFC prewarm; as pg_prewarm does a front-to-back load of the data, rather than utilize information about access frequency to load frequently-accessed pages first (which we can do if we keep dump and reload in inverse LRU order).

knizhnik · 2024-12-06T19:08:43Z

Check LSNs for data can be ignored because no page could have been modified in the meantime, as page loads and modifications must have written a page to the LFC

Page can be modified without reading: if backend just rewrite this page with new content.

knizhnik · 2024-12-06T19:19:55Z

I think that's wrong

Sorry, I do not understand what is wrong. pgbench -S -c 100 completes prewarming faster than LFC prewarm or pg_prewarm. By completes prewarming I mean that pgbench starts porting the same TPS as before shutdown. It is just a fact.

But yes, I agree that LFC can do prewarming smarter by downloading first most useful pages. But in any case - with any heuristics and stored access counters prewarm can only download pages which may be required, while backend executing some query downloads pages which are actually required. So "natural prewarming" aways has better efficiency than any "manual prewarming".

MMeent · 2024-12-11T12:36:34Z

pgxn/neon/file_cache.c

+	{
+		hash = get_hash_value(lfc_hash, &fs[i].key);
+		entry = hash_search_with_hash_value(lfc_hash, &fs[i].key, hash, HASH_ENTER, &found);
+		/* Do not prewarm chunks which are already present in LFC */
+		if (!found)
+		{
+			entry->offset = lfc_ctl->size++;
+			entry->hash = hash;
+			entry->access_count = 0;
+			entry->prewarm_requested = true;
+			entry->prewarm_started = false;
+			memset(entry->bitmap, 0, sizeof entry->bitmap);
+			/* Most recently visted pages are stored first */
+			dlist_push_head(&lfc_ctl->lru, &entry->list_node);
+			lfc_ctl->used += 1;
+		}
+	}


This should still prewarm the pages that aren't present in LFC, while right now it would ignore prewarm([1,2,3]) if LFC state is loaded([4]).

Sorry, may be I do not correctly understand you. Do you mean that we will not prewarm marked pages from the chunk if some other page from the chunk is already loaded?
Yes, it is true. And it is consequence of chunk granularity. There are two reasons of using chunks larger than page size:

Reduce size of LFC hash in shared memory

Improve access locality = seqscan speed (storing subsequent pages together on the disk).

Because of 1) we have to detect conflicts between backends and prewarm on chunk level.
But if probability of conflict is small (as it is assumed) then it should not be a big problem.

MMeent · 2024-12-11T12:43:31Z

pgxn/neon/file_cache.c

+				};
+				shard_no = get_shard_number(&fs[chunk_no].key);
+				while (!page_server->send(shard_no, (NeonRequest *) &request)
+					   || !page_server->flush(shard_no))


I think the flush should only happen once we start waiting?

And this behaviour that circumvents the prefetch system is actually unsafe, as this will cause a hung backend in the following case:

Start requesting pages [A, B, ...]:

Request page A
Connection state: [GetPage A]

PS connection breaks
Connection state: []

Request page B
Connection state: [GetPage B]

Start processing pages:

Wait for page A -> receives page B
Connection state: []

Wait for page B -> hangs
...

Thank you very much for pointing the problem. I missed it.
I think that the problem is not in flush. We in any case have some sequence of prewarm page queries which were sent to PS and which are lost in case of PS reconnect. We should perform cleanup as it is done now for prefetch.

I think that the problem is not in flush

Correct, flush wasn't an issue by itself, but I did think it was slightly wasteful to flush on every send rather than only after all sends were written.

MMeent · 2024-12-11T12:44:46Z

pgxn/neon/file_cache.c

+
+	elog(LOG, "LFC: start loading %ld chunks", (long)n_entries);
+
+	while (true)


This doesn't lock the relation; and that might cause PS errors when the relation gets dropped concurrently.

Yes, we should handle (ignore) possible PS errors.

MMeent · 2024-12-11T12:57:45Z

pgxn/neon/file_cache.c

+			if (entry->prewarm_started)
+			{
+				/*
+				 * Some page of this chunk is currently written by `lfc_prewarm`.
+				 * We should give-up not to interfere with it.
+				 * But clearing `prewarm_requested` flag also will not allow `lfc_prewarm` to fix it result.
+				 */
+				entry->prewarm_requested = false;
+				LWLockRelease(lfc_lock);
+				return;
+			}


I think this is unsound (or at least needs a clear warning here and in prefetch):

Have chunk C that contains written page P

Have prewarm start prefetching chunk C page Q

Have a backend write to chunk C's page P

Write is ignored because prefetch state flag was set.

Next read reads stale LFC entry.

If "1. Have chunk C that contains written page P" is impossible (due to only prefetching nonexistent/empty chunks) then we need a clear warning about this, and we'd have the bad-for-performance race condition:

Prefetch starts for chunk C

Backend writes modified page P for chunk C

Backend ignores LFC write because of prefetch state set

Backend reads page P for chunk C, has to go to PS instead of reading recently-written modified page.

Next read reads stale LFC entry.

Stale reads are certainly not accepted.
Thank you for pointing it. I need to clear bits for all affected pages in this case.

knizhnik requested review from a team as code owners October 31, 2024 08:22

knizhnik requested review from kelvich and skyzh October 31, 2024 08:22

knizhnik force-pushed the lfc_prewarm_light branch from a7648f3 to edb42d8 Compare November 15, 2024 08:25

knizhnik enabled auto-merge November 30, 2024 20:38

knizhnik disabled auto-merge November 30, 2024 20:40

MMeent reviewed Dec 3, 2024

View reviewed changes

MMeent requested changes Dec 11, 2024

View reviewed changes

Konstantin Knizhnik added 7 commits December 14, 2024 21:09

Add functions to get LFC state, prewarm LFC and monitor prewarm process

ec8b8b9

Add delay between upgrade of extension version

dc1684e

Fix handling zero neon.file_cache_prewarm_limit

7e2fb10

correctly handle PS disconect duriug prewarm

e07eedc

Fix format warning

7b80ad4

Set LFC path in test-lfc_prewarm test

1f2b47c

Do not run test_lfc_prewarm test without LFC

07027bd

knizhnik force-pushed the lfc_prewarm_light branch from a826a99 to 07027bd Compare December 14, 2024 19:11

Konstantin Knizhnik added 2 commits December 14, 2024 21:58

Eliminate stale reads from LFC in case of prewarm conflict

1d77fb0

Make ruffhappy

5f74ff1

knizhnik mentioned this pull request Dec 18, 2024

Implement LFC prewarm #9197

Closed

5 tasks

Add test for prewarm under workload

cca517f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587

Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587

knizhnik commented Oct 31, 2024

github-actions bot commented Oct 31, 2024 •

edited

Loading

Postgres 15

knizhnik commented Nov 19, 2024 •

edited

Loading

MMeent left a comment

MMeent Dec 3, 2024

knizhnik Dec 3, 2024

MMeent Dec 3, 2024

knizhnik Dec 3, 2024

MMeent Dec 3, 2024

knizhnik Dec 3, 2024

MMeent Dec 3, 2024

knizhnik Dec 5, 2024

MMeent Dec 6, 2024

knizhnik Dec 6, 2024

MMeent Dec 6, 2024

knizhnik commented Dec 6, 2024

MMeent commented Dec 6, 2024

knizhnik commented Dec 6, 2024

knizhnik commented Dec 6, 2024

MMeent Dec 11, 2024

knizhnik Dec 12, 2024

MMeent Dec 11, 2024

knizhnik Dec 12, 2024

MMeent Dec 18, 2024 •

edited

Loading

MMeent Dec 11, 2024

knizhnik Dec 12, 2024

MMeent Dec 11, 2024

knizhnik Dec 14, 2024


		elog(LOG, "LFC: start loading %ld chunks", (long)n_entries);

		while (true)

Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587

Are you sure you want to change the base?

Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587

Conversation

knizhnik commented Oct 31, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Oct 31, 2024 • edited Loading

7117 tests run: 6813 passed, 0 failed, 304 skipped (full report)

Postgres 15

Code coverage* (full report)

knizhnik commented Nov 19, 2024 • edited Loading

MMeent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knizhnik commented Dec 6, 2024

MMeent commented Dec 6, 2024

knizhnik commented Dec 6, 2024

knizhnik commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MMeent Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 31, 2024 •

edited

Loading

knizhnik commented Nov 19, 2024 •

edited

Loading

MMeent Dec 18, 2024 •

edited

Loading