-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587
base: main
Are you sure you want to change the base?
Conversation
7117 tests run: 6813 passed, 0 failed, 304 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
cca517f at 2024-12-23T10:36:36.236Z :recycle: |
a7648f3
to
edb42d8
Compare
First results of prewarming of large nodes. And results are very ... don't know proper word: not unexpected, because I expect something like this but not so much, not so upset, because it demenstrate importance of prewarm, and not exciting because something is definitely wrong (most likely with LFC). So the whole database should fit in LFC.
It is launched at compute node itself to minimize network latencies (better run it at some other node in the same cloud, but this is what I do not know now how to do:)
After node restart peformance is very bad and grow slowly:
What is worser - it is not growing monotonically - there are picks and falls.
It took about 500 seconds to scan all table (with efficient_io_concurrency=100 ) after which Ic once again run pgbench and now it almost reaches previous level, but still not immediately:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on only the re-loading part, not (yet) the state extraction.
uint32 access_count : 30; | ||
uint32 prewarm_requested : 1; /* entry should be filled by prewarm */ | ||
uint32 prewarm_started : 1; /* chunk is written by lfc_prewarm */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather have this combination of bitfields be a separate struct, or a single uint32 accessed through macros.
Additionally, prewarm_requested
/prewarm_started
are used like bools all around the new code, so those fields should really be bool
-typed, not uint32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not completely understand your arguments here.
Yes, it is possible to move this bits to separate array, but why? 30 bits is more than enough for access_counter - there will be never milliard backbends. So this bits are just wasting. And size of this structure is quite critical - because it is multiplied by maximal possible LFC size. Certainly it is possible to make it more compact for example by replacing pointers with indexes for LRU list. But still I do not see any good reasons in replacing bit fields with bool.
Also I completely do not understand you concern about using 1-bit field as bool. It is quite common practice in C and in Postgres code also. I do not see anything dangerous, inconvenient tr confusing here.
@@ -149,7 +168,7 @@ lfc_disable(char const *op) | |||
{ | |||
int fd; | |||
|
|||
elog(WARNING, "Failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path); | |||
elog(WARNING, "LFC: failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use an lfc_elog()
macro which adds the "LFC: "
string for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - will do it.
typedef struct FileCacheStateEntry | ||
{ | ||
BufferTag key; | ||
uint32 bitmap[CHUNK_BITMAP_SIZE]; | ||
} FileCacheStateEntry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is the entry stored to describe "please prefetch these blocks"; right?
I'm not sure about this layout. Why should't we just prefetch the whole chunk starting at BufferTag.key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BufferTag is 20 bytes. CHUNK_BITMAP_SIZE
with 1Mb chunk needs 16 bytes. So it increase size of FileCacheStateEntry less than two times. But here size seems to be not so important. But reducing about of downloaded pages seems to be important, especially for OLTP kind of workload. In principle, smart loader can check number of set bits in this bitmap and either download all chunk, either individual pages. But right now it is not implemented.
* But as far as write to LFC is performed without holding lock, there is no guarantee that no such write is in progress. | ||
* This is why second flag is used: `prewarm_started`. It is set by `lfc_prewarm` when is starts writing page and cleared when write is completed. | ||
* Any other backend writing to LFC should abandon it's write to LFC file (just not mark page as loaded in bitmap) if this flag is set. | ||
* So neither `lfc_prewarm`, neither backend are saving page in LFC in this case - it is just skipped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do something like the following:
-
Prewarmer: With exclusively locked LFC:
- Prewarmer: Get-or-insert chunk
- Prewarmer: Mark chunk as LFC_PREWARM_STARTED
- Prewarmer: copy prewarmer worker number (if relevant) into new
prewarm_worker_field
on chunk (this needs to be only a few bits, e.g. 8, as we only will have few workers), - Prewarmer: unlink chunk from LRU list, increasing refcount by 1
- Prewarmer: Copy refcount into per-prewarmworker state's 'wait_count', reduced by 1 (self).
- Release LFC lock.
Now the chunk we're going to put data into is pinned (has been removed from the LRU, can't be reclaimed) and no new backends will start write-IO on the chunk due to the LFC_PREWARM_STARTED flag.
-
Normal backends: New writes that get to that chunk start waiting on (new) condition variable
prewarm_progress
in the reported prewarm worker's worker chunk, until the lfc chunk doesn't haveLFC_PREWARM_STARTED
anymore. -
Prewarmer:
- wait on new condition variable
chunkrelease
until the wait_count is empty (no more active concurrent writes) - Write out own data
- Check LSNs for data
- Set 'written' bit for valid data.
- Unset LFC_PREWARM_STARTED
- Signal condition variable
prewarm_progress
- Release and link chunk as normal.
- wait on new condition variable
Meanwhile, a normal backend will do:
- Acquire a chunk, noting down whether LFC_PREWARM_STARTED is set for it in
lpstarted
. - If I'm writing and
lpstarted
, wait on the worker'sprewarm_progress
Condition Variable until the LFC_PREWARM_STARTED flag is released - Do my thing
- While releasing my chunk with exclusive lock, if not
lpstarted
, but the chunk now has LFC_PREWARM_STARTED set:- Decrease the prewarm worker's
wait_count
- If
wait_count == 0
, after releasing the LFC lock, signal the chunk's worker'schunkrelease
condition variable.
- Decrease the prewarm worker's
I think this is robust against concurrent writes to the same block, as we make sure to never have normal writers and the LFC prewarmer write to the same chunk at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about something like this (make backend wait completion of block load by prewarmer.
But I consider such probability as relatively small (otherwise, if most of the bocks which are loaded by prewarm are in any case accessed by backends, then we do not need prewarm at all).
Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.
You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.
Your LFC_PREWARM_STARTED
flag servers the same goal as my prewarm_started
flag. The main difference is that in your approachm backend will wait prwarm completion while in my case - just skip this entry. Certainly second approach is much more simple while your approach can in some cases provide better performance. But I do not think that this difference will be noticeable in the assumption that probability of such collision is small enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(otherwise, if most of the bocks which are loaded by prewarm are in any case accessed by backends, then we do not need prewarm at all).
I'm not sure I follow. We prewarm not just to get the data into LFC, but to have backends hit the LFC rather than having to access the pageserver.
Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.
That's what the LSN check is for.
You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.
Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.
Your LFC_PREWARM_STARTED flag servers the same goal as my prewarm_started flag. The main difference is that in your approachm backend will wait prwarm completion while in my case - just skip this entry.
It seems to me that the main difference is that in your case the backend will not write out its page, while in my case it will (after waiting for the prefetch worker). Or do I misunderstand the description?
Note that the design I described is just on how we can merge prewarmed chunks' data into the LFC, it doesn't cover when and how we decide which chunks to prewarm. That'd probably be something along the lines of:
- Get next chunk from prewarm state
- If the chunk has no current entry in the LFC, and there are no empty chunks left, exit.
- If the chunk is already in LFC, remove the pages that are already in the LFC from the list of to-be-fetched pages for this chunk.
- Fetch the pages for this chunk, keeping notes on the LSN bounds used to fetch the pages
- Merge the pages into the chunk (see Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587 (comment) for algorithm details) while keeping the noted-down LwLSN bounds in mind.
These steps can be executed in superscalar/pipelined manner to maximize IO performance in the prefetching process(es).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. We prewarm not just to get the data into LFC, but to have backends hit the LFC rather than having to access the pageserver.
Sorry, that I was unclear.
So, there can be three difference situations:
- OLTP workload with small number of clients
- OLTP workload with large number of clients
- OLAP workload
In case 3) prewarm is not needed - OLAP queries do it much more efficient themselves.
My experiments shows that in case 2) prewarm is also not needed: larger number of parallel backends can do prewarm much faster than separate prewarm worker.
So actually we need prewarm only in case 3 - where there are just few backends which perform OLTP queries (index scan). If database is larger, it may take quite long time for this few backends to access the whole working set - fill LFC with data. All this time this backends and so client will suffer from large delays (because of get page
latency). Only in this case expect prewarm can be useful. But as far as there is small number of backends and larger database, chances that this backends will conflict with prewarm are very small.
This why spending much efforts for more efficient handling of this conflicts seems to be waste of time.
Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.
That's what the LSN check is for.
Storing 8 bytes per page (not per 1Mb chunk!) seems to be too expensive: with 1TB LFC we will have 128Mb just for storing LSNs. It is the same as we have no for shared buffers.
You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.
Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.
Not quite clear to me how it can work. So prewarmer get page tp be prewarmed, get its lwlsn mark it as "prewarm started", release lock and start waiting for response from PS. At the same moment some backend can write this page, this pages is thrown aways and then once again retrieved by some backend. This backend will see "prewarming" flag and waits until prewarm is completed. Ok, prewar is completed. How to check now now that load version is what we need? We should remember request LSN somewhere ... but where? Or prewarmer can try check it itself by once again checking lwlsn. But it is once again racy because there is actually no sync between backends and prewarmer.
I have discussed this problem with Heikki. Hw first suggest to perform prewarming through shared buffers - in this case we have proper locking. But I do not want for prewarm to affect shared buffer, especially taken in account that they are small. And once again - if probability of conflict is very small, it is enough to have quite conservative check here and no need to involve conditional vars, lwlsn, ...
LwLSN checks are also not so trivial, taken in account that it is limited in size (entries can be evicted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there can be three difference situations:
- OLTP workload with small number of clients
- OLTP workload with large number of clients
- OLAP workload
In case 3) prewarm is not needed - OLAP queries do it much more efficient themselves.
I only partially agree here - indexes are slow to prewarm and I'd rather have them in memory by the time they're about to get hit.
My experiments shows that in case 2) prewarm is also not needed: larger number of parallel backends can do prewarm much faster than separate prewarm worker.
I'm not sure about that. Your tests with prewarming have so far been using pg_prewarm, with whole relations at a time. That's not exactly representative of loading the previous compute's LFC chunks in order of most recent usage (inverse RLU), concurrent to the active workload.
Furthermore, so far testing has used pgbench, in normal mode. While pgbench is one tool for benchmarking performance, it's not exactly comparable to joining OLTP workloads, which are much more sensitive to IO in index lookup paths; and not very comparable to session-based workloads that often have some time between commit and the next transaction starting. A rate-based (rather than throughput-based) pgbench run might show more favorable results.
So actually we need prewarm only in case 3 - where there are just few backends which perform OLTP queries (index scan). If database is larger, it may take quite long time for this few backends to access the whole working set - fill LFC with data. All this time this backends and so client will suffer from large delays (because of
get page
latency). Only in this case expect prewarm can be useful. But as far as there is small number of backends and larger database, chances that this backends will conflict with prewarm are very small.This why spending much efforts for more efficient handling of this conflicts seems to be waste of time.
With many clients it's quite likely the prewarmer was busy fetching pages that weren't yet in LFC, and which are useful to other sessions. Explicitly dropping those pages sounds like a waste of time and resources, given that any backend that might further request that page is now going to have to request it again with your design, and I think that's a larger waste of resources.
Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.
That's what the LSN check is for.
Storing 8 bytes per page (not per 1Mb chunk!) seems to be too expensive: with 1TB LFC we will have 128Mb just for storing LSNs. It is the same as we have no for shared buffers.
We don't store LSNs for pages in the LFC? We only have an LwLSN cache inside PostgreSQL, which we'll keep in mind when we fetch a chunk's pages. We don't have to store that data anywhere else; it's only used to synchronize backends so that we can't miss concurrent page updates for pages that get read; modified; flushed; evicted from LFC in the time it takes us to prefetch that page (which is extremely unlikely, but possible).
You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.
Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.
Not quite clear to me how it can work. So prewarmer get page tp be prewarmed, get its lwlsn mark it as "prewarm started", release lock and start waiting for response from PS. At the same moment some backend can write this page, this pages is thrown aways and then once again retrieved by some backend. This backend will see "prewarming" flag and waits until prewarm is completed. Ok, prewar is completed. How to check now now that load version is what we need? We should remember request LSN somewhere ... but where? Or prewarmer can try check it itself by once again checking lwlsn. But it is once again racy because there is actually no sync between backends and prewarmer.
Not exactly. What we'd do is
-
Select a chunk to prewarm.
This requires the chunk to be present in LFC, or an empty chunk to be available in the LFC that we can then fill.
If there are no empty chunks left, and the chunk we selected isn't present in the LFC, we skip this chunk as evicting data to prewarm other data may be detrimental to the workload. -
Unlink the selected chunk from the LRU
Now it can't be removed through LRU eviction, so we know we can read and write data into it without many issues. -
Note down which blocks are already written to the chunk
Now we don't have to read those blocks, as they're already the latest version -
Unlock the LFC
-
Get LwLSNs for all blocks we're about to prefetch
-
Fetch the blocks with those LwLSNs from PS into local memory
-
Start merging the data into the LFC entry (adapted from earlier algo)
-
With exclusively locked LFC:
- Mark chunk as LFC_PREWARM_STARTED
- Copy prewarmer worker number (if relevant) into new
prewarm_worker_field
on chunk (this needs to be only a few bits, e.g. 8, as we only will have few workers), - unlink chunk from LRU list, increasing refcount by 1
- Copy refcount into per-prewarmworker state's 'wait_count', reduced by 1 (self).
- Release LFC lock.
Now the chunk we're going to put data into is pinned (has been removed from the LRU, can't be reclaimed) and no new backends will start write-IO on pages that aren't present in the chunk due to the LFC_PREWARM_STARTED flag.
-
Normal backends: Writes for new pages that get to that chunk start waiting on (new) condition variable
prewarm_progress
in the reported prewarm worker's worker chunk, until the lfc chunk doesn't haveLFC_PREWARM_STARTED
anymore. -
Prewarmer:
- wait on new condition variable
chunkrelease
until the wait_count is empty (no more active concurrent writes) - Write out own data into pages that aren't yet present in the chunk
Check LSNs for datacan be ignored because no page could have been modified in the meantime, as page loads and modifications must have written a page to the LFC- Set 'written' bit for valid data.
- Unset LFC_PREWARM_STARTED
- Signal condition variable
prewarm_progress
- Release and link chunk as normal.
- wait on new condition variable
Here, only concurrent writers, and no other processes, wait only when the prewarm process while the prewarmer is writing data into the chunk - readers can always read any data that's already present (but may just also decide to wait for the prewarmer to complete if the chunk doesn't yet contain the page they're looking for).
I have discussed this problem with Heikki. Hw first suggest to perform prewarming through shared buffers - in this case we have proper locking.
The locking would be as safe as can be, yes.
But I do not want for prewarm to affect shared buffer, especially taken in account that they are small.
Agreed.
And once again - if probability of conflict is very small, it is enough to have quite conservative check here and no need to involve conditional vars, lwlsn, ...
LwLSN checks are also not so trivial, taken in account that it is limited in size (entries can be evicted).
How and why would we be allowed to skip doing the LWLSN checks in the general case?
Certainly it depends on workload. But I made this conclusion based on the results of my experiments with prewarming using |
I think that's wrong; |
Page can be modified without reading: if backend just rewrite this page with new content. |
Sorry, I do not understand what is wrong. But yes, I agree that LFC can do prewarming smarter by downloading first most useful pages. But in any case - with any heuristics and stored access counters prewarm can only download pages which may be required, while backend executing some query downloads pages which are actually required. So "natural prewarming" aways has better efficiency than any "manual prewarming". |
{ | ||
hash = get_hash_value(lfc_hash, &fs[i].key); | ||
entry = hash_search_with_hash_value(lfc_hash, &fs[i].key, hash, HASH_ENTER, &found); | ||
/* Do not prewarm chunks which are already present in LFC */ | ||
if (!found) | ||
{ | ||
entry->offset = lfc_ctl->size++; | ||
entry->hash = hash; | ||
entry->access_count = 0; | ||
entry->prewarm_requested = true; | ||
entry->prewarm_started = false; | ||
memset(entry->bitmap, 0, sizeof entry->bitmap); | ||
/* Most recently visted pages are stored first */ | ||
dlist_push_head(&lfc_ctl->lru, &entry->list_node); | ||
lfc_ctl->used += 1; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still prewarm the pages that aren't present in LFC, while right now it would ignore prewarm([1,2,3]) if LFC state is loaded([4]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, may be I do not correctly understand you. Do you mean that we will not prewarm marked pages from the chunk if some other page from the chunk is already loaded?
Yes, it is true. And it is consequence of chunk granularity. There are two reasons of using chunks larger than page size:
- Reduce size of LFC hash in shared memory
- Improve access locality = seqscan speed (storing subsequent pages together on the disk).
Because of 1) we have to detect conflicts between backends and prewarm on chunk level.
But if probability of conflict is small (as it is assumed) then it should not be a big problem.
pgxn/neon/file_cache.c
Outdated
}; | ||
shard_no = get_shard_number(&fs[chunk_no].key); | ||
while (!page_server->send(shard_no, (NeonRequest *) &request) | ||
|| !page_server->flush(shard_no)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the flush should only happen once we start waiting?
And this behaviour that circumvents the prefetch system is actually unsafe, as this will cause a hung backend in the following case:
- Start requesting pages [A, B, ...]:
- Request page A
Connection state: [GetPage A] - PS connection breaks
Connection state: [] - Request page B
Connection state: [GetPage B]
- Request page A
- Start processing pages:
- Wait for page A -> receives page B
Connection state: [] - Wait for page B -> hangs
...
- Wait for page A -> receives page B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for pointing the problem. I missed it.
I think that the problem is not in flush. We in any case have some sequence of prewarm page queries which were sent to PS and which are lost in case of PS reconnect. We should perform cleanup as it is done now for prefetch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the problem is not in flush
Correct, flush
wasn't an issue by itself, but I did think it was slightly wasteful to flush
on every send
rather than only after all send
s were written.
|
||
elog(LOG, "LFC: start loading %ld chunks", (long)n_entries); | ||
|
||
while (true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't lock the relation; and that might cause PS errors when the relation gets dropped concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should handle (ignore) possible PS errors.
pgxn/neon/file_cache.c
Outdated
if (entry->prewarm_started) | ||
{ | ||
/* | ||
* Some page of this chunk is currently written by `lfc_prewarm`. | ||
* We should give-up not to interfere with it. | ||
* But clearing `prewarm_requested` flag also will not allow `lfc_prewarm` to fix it result. | ||
*/ | ||
entry->prewarm_requested = false; | ||
LWLockRelease(lfc_lock); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unsound (or at least needs a clear warning here and in prefetch):
- Have chunk C that contains written page P
- Have prewarm start prefetching chunk C page Q
- Have a backend write to chunk C's page P
- Write is ignored because prefetch state flag was set.
- Next read reads stale LFC entry.
If "1. Have chunk C that contains written page P" is impossible (due to only prefetching nonexistent/empty chunks) then we need a clear warning about this, and we'd have the bad-for-performance race condition:
- Prefetch starts for chunk C
- Backend writes modified page P for chunk C
- Backend ignores LFC write because of prefetch state set
- Backend reads page P for chunk C, has to go to PS instead of reading recently-written modified page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next read reads stale LFC entry.
Stale reads are certainly not accepted.
Thank you for pointing it. I need to clear bits for all affected pages in this case.
a826a99
to
07027bd
Compare
Problem
Lightweight version of PR #9197
Summary of changes
Now there just two functions for getting LFC state and prewarming.
AUX mechanism is not used and BGW is not started.
Checklist before requesting a review
Checklist before merging