Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

knizhnik
Copy link
Contributor

Problem

Lightweight version of PR #9197

Summary of changes

Now there just two functions for getting LFC state and prewarming.
AUX mechanism is not used and BGW is not started.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@knizhnik knizhnik requested review from a team as code owners October 31, 2024 08:22
@knizhnik knizhnik requested review from kelvich and skyzh October 31, 2024 08:22
Copy link

github-actions bot commented Oct 31, 2024

7117 tests run: 6813 passed, 0 failed, 304 skipped (full report)


Flaky tests (1)

Postgres 15

Code coverage* (full report)

  • functions: 31.2% (8399 of 26877 functions)
  • lines: 48.0% (66679 of 139059 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
cca517f at 2024-12-23T10:36:36.236Z :recycle:

@knizhnik
Copy link
Contributor Author

knizhnik commented Nov 19, 2024

First results of prewarming of large nodes.
Actually not so large, because I faieled to create node with LFC size greater than 38Gb.
Right now I tested just pgbench scale 2000 (~30Gb database).

And results are very ... don't know proper word: not unexpected, because I expect something like this but not so much, not so upset, because it demenstrate importance of prewarm, and not exciting because something is definitely wrong (most likely with LFC).

So the whole database should fit in LFC.
I measured read-only performance using the following command:

  pgbench -c 100 -S -j 4 -M prepared  -T 1000 -P 1

It is launched at compute node itself to minimize network latencies (better run it at some other node in the same cloud, but this is what I do not know now how to do:)
Results after initialization are the following:

progress: 1.0 s, 96545.2 tps, lat 0.977 ms stddev 1.108, 0 failed
progress: 10.0 s, 122544.2 tps, lat 0.813 ms stddev 0.925, 0 failed
progress: 20.0 s, 140329.7 tps, lat 0.711 ms stddev 0.780, 0 failed
progress: 90.0 s, 143409.1 tps, lat 0.696 ms stddev 0.759, 0 failed
progress: 100.0 s, 144945.9 tps, lat 0.688 ms stddev 0.754, 0 failed
tps = 140099.783473 (without initial connection time)

After node restart peformance is very bad and grow slowly:

progress: 1.0 s, 405.0 tps, lat 187.010 ms stddev 184.346, 0 failed
progress: 10.0 s, 22684.7 tps, lat 4.140 ms stddev 7.728, 0 failed
progress: 20.0 s, 3603.0 tps, lat 28.027 ms stddev 38.786, 0 failed
progress: 30.0 s, 29257.7 tps, lat 3.392 ms stddev 2.885, 0 failed
progress: 40.0 s, 3698.0 tps, lat 26.698 ms stddev 19.502, 0 failed
progress: 50.0 s, 5462.2 tps, lat 18.147 ms stddev 17.474, 0 failed
progress: 60.0 s, 34994.5 tps, lat 2.651 ms stddev 2.765, 0 failed
progress: 170.0 s, 13104.6 tps, lat 7.625 ms stddev 14.757, 0 failed
progress: 280.0 s, 25171.5 tps, lat 4.074 ms stddev 12.153, 0 failed
progress: 300.0 s, 72801.1 tps, lat 1.212 ms stddev 2.983, 0 failed
progress: 400.0 s, 88721.7 tps, lat 0.993 ms stddev 2.414, 0 failed
progress: 1000.1 s, 83554.8 tps, lat 1.017 ms stddev 2.460, 0 failed
tps = 74302.574857 (without initial connection time)

What is worser - it is not growing monotonically - there are picks and falls.
After ~500 seconds it reachs level ~80k TPS and reaches at this evel till 1000 seconds.
Then I try to perform manual prewarm:

 explain (analyze,prefetch,buffers) select sum(abalance) from pgbench_accounts;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5778689.00..5778689.01 rows=1 width=8) (actual time=466585.455..466585.457 rows=1 loops=1)
   Buffers: shared read=3278689
   Prefetch: hits=3278590 misses=99 expired=0 duplicates=0
   ->  Seq Scan on pgbench_accounts  (cost=0.00..5278689.00 rows=200000000 width=4) (actual time=1.199..455158.039 rows=200000000 loops=1)
         Buffers: shared read=3278689
         Prefetch: hits=3278590 misses=99 expired=0 duplicates=0
 Planning:
   Buffers: shared hit=50 read=4 dirtied=2
 Planning Time: 12.401 ms
 Execution Time: 466585.510 ms

It took about 500 seconds to scan all table (with efficient_io_concurrency=100 ) after which Ic once again run pgbench and now it almost reaches previous level, but still not immediately:

progress: 1.0 s, 1076.9 tps, lat 71.586 ms stddev 50.345, 0 failed
progress: 2.0 s, 40851.4 tps, lat 2.511 ms stddev 5.352, 0 failed
progress: 3.0 s, 70079.0 tps, lat 1.419 ms stddev 0.823, 0 failed
progress: 4.0 s, 80905.4 tps, lat 1.232 ms stddev 0.820, 0 failed
progress: 5.0 s, 82107.8 tps, lat 1.211 ms stddev 0.910, 0 failed
progress: 6.0 s, 42225.7 tps, lat 2.366 ms stddev 2.636, 0 failed
progress: 7.0 s, 91945.7 tps, lat 1.082 ms stddev 1.145, 0 failed
progress: 8.0 s, 97998.8 tps, lat 1.015 ms stddev 0.950, 0 failed
progress: 9.0 s, 98887.5 tps, lat 1.005 ms stddev 0.996, 0 failed
progress: 10.0 s, 99914.3 tps, lat 0.995 ms stddev 1.211, 0 failed
progress: 20.0 s, 122603.2 tps, lat 0.810 ms stddev 1.240, 0 failed
progress: 30.0 s, 138502.4 tps, lat 0.719 ms stddev 0.835, 0 failed
progress: 100.0 s, 134946.8 tps, lat 0.732 ms stddev 0.849, 0 failed
tps = 125702.532030 (without initial connection time)

@knizhnik knizhnik enabled auto-merge November 30, 2024 20:38
@knizhnik knizhnik disabled auto-merge November 30, 2024 20:40
Copy link
Contributor

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on only the re-loading part, not (yet) the state extraction.

Comment on lines +108 to +110
uint32 access_count : 30;
uint32 prewarm_requested : 1; /* entry should be filled by prewarm */
uint32 prewarm_started : 1; /* chunk is written by lfc_prewarm */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather have this combination of bitfields be a separate struct, or a single uint32 accessed through macros.

Additionally, prewarm_requested/prewarm_started are used like bools all around the new code, so those fields should really be bool-typed, not uint32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not completely understand your arguments here.
Yes, it is possible to move this bits to separate array, but why? 30 bits is more than enough for access_counter - there will be never milliard backbends. So this bits are just wasting. And size of this structure is quite critical - because it is multiplied by maximal possible LFC size. Certainly it is possible to make it more compact for example by replacing pointers with indexes for LRU list. But still I do not see any good reasons in replacing bit fields with bool.

Also I completely do not understand you concern about using 1-bit field as bool. It is quite common practice in C and in Postgres code also. I do not see anything dangerous, inconvenient tr confusing here.

@@ -149,7 +168,7 @@ lfc_disable(char const *op)
{
int fd;

elog(WARNING, "Failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path);
elog(WARNING, "LFC: failed to %s local file cache at %s: %m, disabling local file cache", op, lfc_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use an lfc_elog() macro which adds the "LFC: " string for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - will do it.

Comment on lines +138 to +142
typedef struct FileCacheStateEntry
{
BufferTag key;
uint32 bitmap[CHUNK_BITMAP_SIZE];
} FileCacheStateEntry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is the entry stored to describe "please prefetch these blocks"; right?

I'm not sure about this layout. Why should't we just prefetch the whole chunk starting at BufferTag.key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferTag is 20 bytes. CHUNK_BITMAP_SIZE with 1Mb chunk needs 16 bytes. So it increase size of FileCacheStateEntry less than two times. But here size seems to be not so important. But reducing about of downloaded pages seems to be important, especially for OLTP kind of workload. In principle, smart loader can check number of set bits in this bitmap and either download all chunk, either individual pages. But right now it is not implemented.

Comment on lines +539 to +546
* But as far as write to LFC is performed without holding lock, there is no guarantee that no such write is in progress.
* This is why second flag is used: `prewarm_started`. It is set by `lfc_prewarm` when is starts writing page and cleared when write is completed.
* Any other backend writing to LFC should abandon it's write to LFC file (just not mark page as loaded in bitmap) if this flag is set.
* So neither `lfc_prewarm`, neither backend are saving page in LFC in this case - it is just skipped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do something like the following:

  1. Prewarmer: With exclusively locked LFC:

    1. Prewarmer: Get-or-insert chunk
    2. Prewarmer: Mark chunk as LFC_PREWARM_STARTED
    3. Prewarmer: copy prewarmer worker number (if relevant) into new prewarm_worker_field on chunk (this needs to be only a few bits, e.g. 8, as we only will have few workers),
    4. Prewarmer: unlink chunk from LRU list, increasing refcount by 1
    5. Prewarmer: Copy refcount into per-prewarmworker state's 'wait_count', reduced by 1 (self).
    6. Release LFC lock.

    Now the chunk we're going to put data into is pinned (has been removed from the LRU, can't be reclaimed) and no new backends will start write-IO on the chunk due to the LFC_PREWARM_STARTED flag.

  2. Normal backends: New writes that get to that chunk start waiting on (new) condition variable prewarm_progress in the reported prewarm worker's worker chunk, until the lfc chunk doesn't have LFC_PREWARM_STARTED anymore.

  3. Prewarmer:

    1. wait on new condition variable chunkrelease until the wait_count is empty (no more active concurrent writes)
    2. Write out own data
    3. Check LSNs for data
    4. Set 'written' bit for valid data.
    5. Unset LFC_PREWARM_STARTED
    6. Signal condition variable prewarm_progress
    7. Release and link chunk as normal.

Meanwhile, a normal backend will do:

  1. Acquire a chunk, noting down whether LFC_PREWARM_STARTED is set for it in lpstarted.
  2. If I'm writing and lpstarted, wait on the worker's prewarm_progress Condition Variable until the LFC_PREWARM_STARTED flag is released
  3. Do my thing
  4. While releasing my chunk with exclusive lock, if not lpstarted, but the chunk now has LFC_PREWARM_STARTED set:
    1. Decrease the prewarm worker's wait_count
    2. If wait_count == 0, after releasing the LFC lock, signal the chunk's worker's chunkrelease condition variable.

I think this is robust against concurrent writes to the same block, as we make sure to never have normal writers and the LFC prewarmer write to the same chunk at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about something like this (make backend wait completion of block load by prewarmer.
But I consider such probability as relatively small (otherwise, if most of the bocks which are loaded by prewarm are in any case accessed by backends, then we do not need prewarm at all).

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Your LFC_PREWARM_STARTED flag servers the same goal as my prewarm_started flag. The main difference is that in your approachm backend will wait prwarm completion while in my case - just skip this entry. Certainly second approach is much more simple while your approach can in some cases provide better performance. But I do not think that this difference will be noticeable in the assumption that probability of such collision is small enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(otherwise, if most of the bocks which are loaded by prewarm are in any case accessed by backends, then we do not need prewarm at all).

I'm not sure I follow. We prewarm not just to get the data into LFC, but to have backends hit the LFC rather than having to access the pageserver.

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

That's what the LSN check is for.

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.

Your LFC_PREWARM_STARTED flag servers the same goal as my prewarm_started flag. The main difference is that in your approachm backend will wait prwarm completion while in my case - just skip this entry.

It seems to me that the main difference is that in your case the backend will not write out its page, while in my case it will (after waiting for the prefetch worker). Or do I misunderstand the description?


Note that the design I described is just on how we can merge prewarmed chunks' data into the LFC, it doesn't cover when and how we decide which chunks to prewarm. That'd probably be something along the lines of:

  1. Get next chunk from prewarm state
  2. If the chunk has no current entry in the LFC, and there are no empty chunks left, exit.
  3. If the chunk is already in LFC, remove the pages that are already in the LFC from the list of to-be-fetched pages for this chunk.
  4. Fetch the pages for this chunk, keeping notes on the LSN bounds used to fetch the pages
  5. Merge the pages into the chunk (see Add functions to get LFC state, prewarm LFC and monitor prewarm process #9587 (comment) for algorithm details) while keeping the noted-down LwLSN bounds in mind.

These steps can be executed in superscalar/pipelined manner to maximize IO performance in the prefetching process(es).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. We prewarm not just to get the data into LFC, but to have backends hit the LFC rather than having to access the pageserver.

Sorry, that I was unclear.
So, there can be three difference situations:

  1. OLTP workload with small number of clients
  2. OLTP workload with large number of clients
  3. OLAP workload

In case 3) prewarm is not needed - OLAP queries do it much more efficient themselves.
My experiments shows that in case 2) prewarm is also not needed: larger number of parallel backends can do prewarm much faster than separate prewarm worker.

So actually we need prewarm only in case 3 - where there are just few backends which perform OLTP queries (index scan). If database is larger, it may take quite long time for this few backends to access the whole working set - fill LFC with data. All this time this backends and so client will suffer from large delays (because of get page latency). Only in this case expect prewarm can be useful. But as far as there is small number of backends and larger database, chances that this backends will conflict with prewarm are very small.

This why spending much efforts for more efficient handling of this conflicts seems to be waste of time.

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

That's what the LSN check is for.

Storing 8 bytes per page (not per 1Mb chunk!) seems to be too expensive: with 1TB LFC we will have 128Mb just for storing LSNs. It is the same as we have no for shared buffers.

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.

Not quite clear to me how it can work. So prewarmer get page tp be prewarmed, get its lwlsn mark it as "prewarm started", release lock and start waiting for response from PS. At the same moment some backend can write this page, this pages is thrown aways and then once again retrieved by some backend. This backend will see "prewarming" flag and waits until prewarm is completed. Ok, prewar is completed. How to check now now that load version is what we need? We should remember request LSN somewhere ... but where? Or prewarmer can try check it itself by once again checking lwlsn. But it is once again racy because there is actually no sync between backends and prewarmer.

I have discussed this problem with Heikki. Hw first suggest to perform prewarming through shared buffers - in this case we have proper locking. But I do not want for prewarm to affect shared buffer, especially taken in account that they are small. And once again - if probability of conflict is very small, it is enough to have quite conservative check here and no need to involve conditional vars, lwlsn, ...

LwLSN checks are also not so trivial, taken in account that it is limited in size (entries can be evicted).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there can be three difference situations:

  1. OLTP workload with small number of clients
  2. OLTP workload with large number of clients
  3. OLAP workload

In case 3) prewarm is not needed - OLAP queries do it much more efficient themselves.

I only partially agree here - indexes are slow to prewarm and I'd rather have them in memory by the time they're about to get hit.

My experiments shows that in case 2) prewarm is also not needed: larger number of parallel backends can do prewarm much faster than separate prewarm worker.

I'm not sure about that. Your tests with prewarming have so far been using pg_prewarm, with whole relations at a time. That's not exactly representative of loading the previous compute's LFC chunks in order of most recent usage (inverse RLU), concurrent to the active workload.

Furthermore, so far testing has used pgbench, in normal mode. While pgbench is one tool for benchmarking performance, it's not exactly comparable to joining OLTP workloads, which are much more sensitive to IO in index lookup paths; and not very comparable to session-based workloads that often have some time between commit and the next transaction starting. A rate-based (rather than throughput-based) pgbench run might show more favorable results.

So actually we need prewarm only in case 3 - where there are just few backends which perform OLTP queries (index scan). If database is larger, it may take quite long time for this few backends to access the whole working set - fill LFC with data. All this time this backends and so client will suffer from large delays (because of get page latency). Only in this case expect prewarm can be useful. But as far as there is small number of backends and larger database, chances that this backends will conflict with prewarm are very small.

This why spending much efforts for more efficient handling of this conflicts seems to be waste of time.

With many clients it's quite likely the prewarmer was busy fetching pages that weren't yet in LFC, and which are useful to other sessions. Explicitly dropping those pages sounds like a waste of time and resources, given that any backend that might further request that page is now going to have to request it again with your design, and I think that's a larger waste of resources.

Also prewarmer and backend can use different LSNs. So there may be no sense to wait until prewarm load block completion - if we can not use it in any case.

That's what the LSN check is for.

Storing 8 bytes per page (not per 1Mb chunk!) seems to be too expensive: with 1TB LFC we will have 128Mb just for storing LSNs. It is the same as we have no for shared buffers.

We don't store LSNs for pages in the LFC? We only have an LwLSN cache inside PostgreSQL, which we'll keep in mind when we fetch a chunk's pages. We don't have to store that data anywhere else; it's only used to synchronize backends so that we can't miss concurrent page updates for pages that get read; modified; flushed; evicted from LFC in the time it takes us to prefetch that page (which is extremely unlikely, but possible).

You're set iii. Check LSNs data assumes that we need to store LSN for each block. It will add extra 8 bytes to each entry. I already wrote that size of LFC hash entry seems to be quite critical, so I prefer no to extend it's size.

Apologies that I didn't make this clearer, but what I meant there is that the prewarmer should check there that the LwLsn for each prefetched page hasn't been invalidated yet. No need to store an LSN on the chunk itself, as we can pull that from the LwLsn cache.

Not quite clear to me how it can work. So prewarmer get page tp be prewarmed, get its lwlsn mark it as "prewarm started", release lock and start waiting for response from PS. At the same moment some backend can write this page, this pages is thrown aways and then once again retrieved by some backend. This backend will see "prewarming" flag and waits until prewarm is completed. Ok, prewar is completed. How to check now now that load version is what we need? We should remember request LSN somewhere ... but where? Or prewarmer can try check it itself by once again checking lwlsn. But it is once again racy because there is actually no sync between backends and prewarmer.

Not exactly. What we'd do is

  1. Select a chunk to prewarm.
    This requires the chunk to be present in LFC, or an empty chunk to be available in the LFC that we can then fill.
    If there are no empty chunks left, and the chunk we selected isn't present in the LFC, we skip this chunk as evicting data to prewarm other data may be detrimental to the workload.

  2. Unlink the selected chunk from the LRU
    Now it can't be removed through LRU eviction, so we know we can read and write data into it without many issues.

  3. Note down which blocks are already written to the chunk
    Now we don't have to read those blocks, as they're already the latest version

  4. Unlock the LFC

  5. Get LwLSNs for all blocks we're about to prefetch

  6. Fetch the blocks with those LwLSNs from PS into local memory

  7. Start merging the data into the LFC entry (adapted from earlier algo)

  8. With exclusively locked LFC:

    1. Mark chunk as LFC_PREWARM_STARTED
    2. Copy prewarmer worker number (if relevant) into new prewarm_worker_field on chunk (this needs to be only a few bits, e.g. 8, as we only will have few workers),
    3. unlink chunk from LRU list, increasing refcount by 1
    4. Copy refcount into per-prewarmworker state's 'wait_count', reduced by 1 (self).
    5. Release LFC lock.

    Now the chunk we're going to put data into is pinned (has been removed from the LRU, can't be reclaimed) and no new backends will start write-IO on pages that aren't present in the chunk due to the LFC_PREWARM_STARTED flag.

  9. Normal backends: Writes for new pages that get to that chunk start waiting on (new) condition variable prewarm_progress in the reported prewarm worker's worker chunk, until the lfc chunk doesn't have LFC_PREWARM_STARTED anymore.

  10. Prewarmer:

    1. wait on new condition variable chunkrelease until the wait_count is empty (no more active concurrent writes)
    2. Write out own data into pages that aren't yet present in the chunk
    3. Check LSNs for data can be ignored because no page could have been modified in the meantime, as page loads and modifications must have written a page to the LFC
    4. Set 'written' bit for valid data.
    5. Unset LFC_PREWARM_STARTED
    6. Signal condition variable prewarm_progress
    7. Release and link chunk as normal.

Here, only concurrent writers, and no other processes, wait only when the prewarm process while the prewarmer is writing data into the chunk - readers can always read any data that's already present (but may just also decide to wait for the prewarmer to complete if the chunk doesn't yet contain the page they're looking for).

I have discussed this problem with Heikki. Hw first suggest to perform prewarming through shared buffers - in this case we have proper locking.

The locking would be as safe as can be, yes.

But I do not want for prewarm to affect shared buffer, especially taken in account that they are small.
Agreed.

And once again - if probability of conflict is very small, it is enough to have quite conservative check here and no need to involve conditional vars, lwlsn, ...

LwLSN checks are also not so trivial, taken in account that it is limited in size (entries can be evicted).

How and why would we be allowed to skip doing the LWLSN checks in the general case?

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 6, 2024

I only partially agree here - indexes are slow to prewarm and I'd rather have them in memory by the time they're about to get hit.

Certainly it depends on workload. But I made this conclusion based on the results of my experiments with prewarming using pgbench -S. 100 clients with standard read-only pgbench workload do prewarming faster than single prewarmer.
And it does just simple index scans with random key (worst case for prewarm).

@MMeent
Copy link
Contributor

MMeent commented Dec 6, 2024

And it does just simple index scans with random key (worst case for prewarm).

I think that's wrong; pg_prewarm('relation') doesn't actually have the same behavior as autoprewarm or LFC prewarm; as pg_prewarm does a front-to-back load of the data, rather than utilize information about access frequency to load frequently-accessed pages first (which we can do if we keep dump and reload in inverse LRU order).

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 6, 2024

Check LSNs for data can be ignored because no page could have been modified in the meantime, as page loads and modifications must have written a page to the LFC

Page can be modified without reading: if backend just rewrite this page with new content.

@knizhnik
Copy link
Contributor Author

knizhnik commented Dec 6, 2024

I think that's wrong

Sorry, I do not understand what is wrong. pgbench -S -c 100 completes prewarming faster than LFC prewarm or pg_prewarm. By completes prewarming I mean that pgbench starts porting the same TPS as before shutdown. It is just a fact.

But yes, I agree that LFC can do prewarming smarter by downloading first most useful pages. But in any case - with any heuristics and stored access counters prewarm can only download pages which may be required, while backend executing some query downloads pages which are actually required. So "natural prewarming" aways has better efficiency than any "manual prewarming".

Comment on lines +593 to +613
{
hash = get_hash_value(lfc_hash, &fs[i].key);
entry = hash_search_with_hash_value(lfc_hash, &fs[i].key, hash, HASH_ENTER, &found);
/* Do not prewarm chunks which are already present in LFC */
if (!found)
{
entry->offset = lfc_ctl->size++;
entry->hash = hash;
entry->access_count = 0;
entry->prewarm_requested = true;
entry->prewarm_started = false;
memset(entry->bitmap, 0, sizeof entry->bitmap);
/* Most recently visted pages are stored first */
dlist_push_head(&lfc_ctl->lru, &entry->list_node);
lfc_ctl->used += 1;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still prewarm the pages that aren't present in LFC, while right now it would ignore prewarm([1,2,3]) if LFC state is loaded([4]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, may be I do not correctly understand you. Do you mean that we will not prewarm marked pages from the chunk if some other page from the chunk is already loaded?
Yes, it is true. And it is consequence of chunk granularity. There are two reasons of using chunks larger than page size:

  1. Reduce size of LFC hash in shared memory
  2. Improve access locality = seqscan speed (storing subsequent pages together on the disk).

Because of 1) we have to detect conflicts between backends and prewarm on chunk level.
But if probability of conflict is small (as it is assumed) then it should not be a big problem.

};
shard_no = get_shard_number(&fs[chunk_no].key);
while (!page_server->send(shard_no, (NeonRequest *) &request)
|| !page_server->flush(shard_no))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flush should only happen once we start waiting?

And this behaviour that circumvents the prefetch system is actually unsafe, as this will cause a hung backend in the following case:

  1. Start requesting pages [A, B, ...]:
    1. Request page A
      Connection state: [GetPage A]
    2. PS connection breaks
      Connection state: []
    3. Request page B
      Connection state: [GetPage B]
  2. Start processing pages:
    1. Wait for page A -> receives page B
      Connection state: []
    2. Wait for page B -> hangs
      ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for pointing the problem. I missed it.
I think that the problem is not in flush. We in any case have some sequence of prewarm page queries which were sent to PS and which are lost in case of PS reconnect. We should perform cleanup as it is done now for prefetch.

Copy link
Contributor

@MMeent MMeent Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the problem is not in flush

Correct, flush wasn't an issue by itself, but I did think it was slightly wasteful to flush on every send rather than only after all sends were written.


elog(LOG, "LFC: start loading %ld chunks", (long)n_entries);

while (true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't lock the relation; and that might cause PS errors when the relation gets dropped concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should handle (ignore) possible PS errors.

Comment on lines 1187 to 1213
if (entry->prewarm_started)
{
/*
* Some page of this chunk is currently written by `lfc_prewarm`.
* We should give-up not to interfere with it.
* But clearing `prewarm_requested` flag also will not allow `lfc_prewarm` to fix it result.
*/
entry->prewarm_requested = false;
LWLockRelease(lfc_lock);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unsound (or at least needs a clear warning here and in prefetch):

  1. Have chunk C that contains written page P
  2. Have prewarm start prefetching chunk C page Q
  3. Have a backend write to chunk C's page P
  4. Write is ignored because prefetch state flag was set.
  5. Next read reads stale LFC entry.

If "1. Have chunk C that contains written page P" is impossible (due to only prefetching nonexistent/empty chunks) then we need a clear warning about this, and we'd have the bad-for-performance race condition:

  1. Prefetch starts for chunk C
  2. Backend writes modified page P for chunk C
  3. Backend ignores LFC write because of prefetch state set
  4. Backend reads page P for chunk C, has to go to PS instead of reading recently-written modified page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next read reads stale LFC entry.

Stale reads are certainly not accepted.
Thank you for pointing it. I need to clear bits for all affected pages in this case.

@knizhnik knizhnik mentioned this pull request Dec 18, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants