Implement replica prewarm #9466

knizhnik · 2024-10-21T14:29:38Z

Problem

LFC has critical impact on Neon node speed. In case of node restart we need to prewarm LFC to reach acceptable level of performance. The idea is to start replica and prewarm its LFC cache while primary is still active and serving user's requests.
Once LFC is rewarmed, we can stop primary node and promote replica to new primary.

This PR implements prewarming of replica's LFC through normal replication protocol.
Primary periodically creates WAL records with state of LFC cache. By replaying this WAL records, replica can load this pages and so maintain the same state LFC as primary.

Summary of changes

I have added new WAL record to Neon RMGR: XLOG_NEON_LFC_PREWARM. This records contains information of LFC chunks. Size of this records is limited by LFC_MAX_PREWARM_SIZE=1024 (right now hardcoded by can be changed to GUC if needed). This records are produced with neon.file_cache_prewarm_rate period (msec).
It includes most recently accessed LFC chunks which are not yet synced. I have added synced flag to LFC chunk entry which is set when information about this chunk is sent to replica is created when new page was added to the chunk.

This XLOG_NEON_LFC_PREWARM are create by background worker launched by neon extension extension.
PS is changed to ignore this records. Replay of this record is implemented in neon_rmgr extension: it just load specified pages.

PPrewarming is controlled by neon.file_cache_prewarm_rate GUC which can be changed at any moment of time (using pg_reload_conf). Setting it to 0 disables prewarm.

Known issues:

This approach increase WAL size (but not storage size)
Pages are loaded by wal receiver, so it slows down applying WAL by replica.
There is no limit for number of prewarmed pages (can add special GUC for it)
If pages are frequently changed at primary node, the same page can be requested to prewarm multiple times at replica.
It works only for PG16/17, because Neon RMGR is not supported by earlier Postgres versions
No parallel prewarming.

In future it can be easily changed to use vector load once they are supported by SMGR protocol.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-10-21T15:14:23Z

5238 tests run: 5019 passed, 1 failed, 218 skipped (full report)

Failures on Postgres 17

test_replica_prewarm: debug-x86-64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_replica_prewarm[debug-pg17]"

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
9f1e2aa at 2024-10-21T15:14:23.062Z :recycle:}

ololobus · 2024-10-21T16:02:49Z

pgxn/neon_rmgr/neon_rmgr.c

@@ -76,6 +80,9 @@ neon_rm_redo(XLogReaderState *record)
 		case XLOG_NEON_HEAP_MULTI_INSERT:
 			redo_neon_heap_multi_insert(record);
 			break;
+		case XLOG_NEON_LFC_PREWARM:


This PR implements prewarming of replica's LFC through normal replication protocol.
Primary periodically creates WAL records with state of LFC cache. By replaying this WAL records, replica can load this pages and so maintain the same state LFC as primary.

Do you have a full picture in mind of how the whole system will work?

I have doubts that this is the simplest and the most efficient one to go with because:

Normally, I think it's reasonable to assume that primary and replica workloads are different. Well, that's 100% true because you do not have updates on replica, but even thinking about selects they might be very different. Look at how Neon uses Postgres internally -- Primary serves the main production operational workload, APIs, etc.; Replica is used for backoffice UIs, background jobs, and other analytical kind of things

Yes, preferably, we do need to prewarm the replica and then do a switchover, but that's only needed shortly before the restart, while with this PR we will be writing a lot of WAL all the time. It's probably possible to overcome by something setting this GUC only before we want to do a restart and disabling it shortly after restart is finished

Restart, or more precisely 'scheduled graceful restart', is not the only problem, there at least two others -- wake up after scale-to-zero and unexpected/unscheduled/unorchestrated restart (e.g. k8s node went down), in both cases we do want a prewarm as well

Saying that, I'm not sure that this approach will eventually help us with solving all the cache prewarming problems on compute. And maybe developing dump/restore API on compute + S3 persistence as we discussed is still the best way to go, then:

The case of the scheduled restart could be covered by something orchestrating it by calling dump on RO and restore on RW

While unexpected restart will be handled by compute self-serve. It will request the cache content from S3 lazily at start and do prewarming

In both cases, I assume that having slightly outdated caches is OK. While that's true for 2., i.e. compute will request the latest version of page from the storage, we just need to know that page N was in the cache before restart; in case 1. it might be not true, because if cache will contain some older version of the page while it's already updated on primary it might cause corruption-like problems after switchover, right? So we might still need something that you proposes here to handle the case 1.

What do you think?

I mostly try to address case of restarting huge node with intensive workload. If node is small or almost idle, then there is no big need in prewarming. Or at least it can be do in other way.

So scenario I take in my head wasn't actually my scenario - it was mentioned by Star at the very beginning of this discussion:

We have running node which we want (need) to restart

We spawn replica and prewarm it. There is no any workload on the replica. So it just prewarming. Once prewarming is completed (actually it is not quite easy to determine when it is necessary to stop prewarming if there is stream of permanent updates on primary), we stop primary and promote replica. Definitely we expect that world at new node will be the same as on old node, so we try to load it's LFC cache with the same content as old node.

What about other scenarios you mentioned? I doubt that huge node under intensive workload will be sometimes scaled to zero (terminated). But crash of the node is definitely possible. In this case we can do what vanilla Postgres does: permanently maintain hot-standby replica. The mechanism proposed in this PR allows to keep it in sync with primary. It is not intended to be used for scaling read-only workload. It is just standby making it possible to perform fast master replacement at each moment of time.

I am not 100% sure that it is the best way to perform prewarming. And it is not still clear to me how critical is prewarming for us (how long it will take for huge node to reach previous level of performance under high load after restart). If there are large number of quires to the database, then may be they can warm cache much more efficiently than pre-warming done by single background worker.

In any case, there seems to be many concerns against using LFC for prewarming and other similar stuff. This PR illustrates how it is possible without LFC and without any other changes in Postgres. Should we follow this way? I don't know. But I really first want to know results of tests.

knizhnik · 2024-10-23T15:35:36Z

Some thought about LFC prewarming (may be not the right place but I do not know better):

Prewarm of replica.
One of the main arguments against using AUX for persisting state of LFC cache was that this mechanism can not be used at replica (because replica can not write WAL). So we can't prewarm replica. But is it actually needed? Replicas are user for two purposes for hot standby and for scaling read-only workload. In Neon (until now) it is not possible to promote replica, so using replica can not be used for HA. Concerning load balancing and read-only queries: it is not so trivial to separate read-only and read-write workload. Moreover even with synchronous replication (which we do not currently support) it is still possible some lag between primary and replica, i.e. if application execute at primary some update and then perform select on replica, that it may not see it's own changes! It can eb addressed by changing sync policy, but ... user can not do it himself. So replica are mostly suitable for OLAP queries. And OLAP queries do not require prewarm: they are efficiently warming ache themselves. Or, if dataset doesn't fit in memory, cache can not be warmed at all.
Race conditions during prewarming.
Access to LFC is currently protected by shared buffers pins/locks. There is a guarantee that if some backend fetching something fro PS and storage it in LFC, then no other backend can concurrently does it: all backends accessing the same page will wait until this buffer will be released. So it is not possible that several backends are concurrently reading or writing the same LFC page. It is essential, because right now in LFC reading/writing files is done without holding locks (to prevent blocking all LFC accesses if syscall is blocked).

But consider the straightforward implementation of prewarm: we somehow capture LFC state and then start background worker which starts loading of this pages from PS and storing they in LFC. There is absolutely no warranty that some other backend will not access and modify (if is is primary) some page and write it to LFC. So two or more backends can concurrently write different content to the same location of the LFC file.

How it is solved in #9197 and this PR?
In #9197 race condition is detected using prewarm_requested and prewarm_started flags in LFC entry state. Entries which should be fetch by prewarm worker are marked with prewarm_requested flag in LFC hash entry. When some backends writes something in LFC it clear this flag , so preventing prewarming if this entry. Also if we find page with prewarm_started flag in LFC cache, then lfc_read returns false (cache miss), so that we can no read incorrect page content. So synchronisation in this PR is based on the assumption that pages can be changed only by this Postgres instance. Can it be extended to replica where changes are made by primary and replayed by walreceiver (if target page is not present then correspondent LFC entry is invalidated using lfc_evict)? I think so, but lfc_evictshould also be changed to take in account concurrent prewarm and we should specifyreplay_lsn` as prewarm request LSN instead of latest LSN.

And in this PR prewarming is performed by walreceiver itself, so there is no race conditions at all,... if replica is not used for execution of read-only queries. It can be considered as serious limitation. But I do not think so. This is not a normal replica used for load balancing. This is "special" replica temporary started for primary node switch-over. It should prewarm LFC cache as fast as possible, restoring state of master's LFC cache and then it is promoted to master.

LSN in LFC
@ololobus proposed to store LFC in LFC metadata (in addition to BufferTag). It can help to prevent some kind of possible errors when we are reading deteriorated content from LFC cache. And may be help to simplify prewarming - because based on LSN we can understand whether cache entry is up-to-date or not.
I have several arguments against this proposal:

LFC is not part of SMGR API. It is taken from last-written-LSN cache (lwLSN). If there are some errors in lwLSN, then storing LSN in LFC can't somehow help to detect reading wrong content.
Storing LSN for each page will significantly increase size of metadata. Right now size of metadata for one chunk (1Mb) is 64 buses. So 1Tb LFC cache needs 64Mb shared LFC hash table. It seems to be acceptable. But if we have to store LN (8 bytes) for each pages, then size of metadata will be 1Gb. And for 100Gb LFC it will be 100Mb - the same as all shared buffers. Is it acceptable? I am not sure... But what is more important, stored LSN can not eliminate all errors and race conditions. There is still a problem that LFC entry is read/write without any locks and if two or more backend try to do it, we can get mess of bytes instead of valid page content.

So what can we do?
We can improve #9197 to replica prewarming (by taking in account prewarm in lfc_evict) and grabbing state of LFC cache on demand (right now it is stored only oil shutdown). Should store LFC state in S3 or somewhere else instead of AUX? I am not sure: I prefer to use single mechanism for persisting all entries. And the fact the we can not capture LFC state at replica seems to be not so critical...

This PR covers just one prewarm scenario: planned node restart. It can not be used for prewarming node after crash or scale to zero. But it doesn't;t require AUX or S3.

Implement replica prewarm

9f1e2aa

knizhnik requested review from a team as code owners October 21, 2024 14:29

knizhnik requested review from ololobus and skyzh October 21, 2024 14:29

ololobus reviewed Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement replica prewarm #9466

Implement replica prewarm #9466

knizhnik commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

ololobus Oct 21, 2024

knizhnik Oct 21, 2024

knizhnik commented Oct 23, 2024 •

edited

Loading

Implement replica prewarm #9466

Are you sure you want to change the base?

Implement replica prewarm #9466

Conversation

knizhnik commented Oct 21, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Oct 21, 2024

5238 tests run: 5019 passed, 1 failed, 218 skipped (full report)

Failures on Postgres 17

Test coverage report is not available

ololobus Oct 21, 2024

Choose a reason for hiding this comment

knizhnik Oct 21, 2024

Choose a reason for hiding this comment

knizhnik commented Oct 23, 2024 • edited Loading

knizhnik commented Oct 23, 2024 •

edited

Loading