Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LFC prewarm #9197

Closed
wants to merge 14 commits into from
Closed

Implement LFC prewarm #9197

wants to merge 14 commits into from

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Sep 30, 2024

Problem

We periodically need to restart computes.
As far as performance of Neon critically depends on local caches (shared buffers or LFC), it is important to provide fast rewarm of restarted node.

Standard Postgres pg_prewarm extension provides autoprewarm mode in which is saves state of shared buffers and loads it after restart. But current autoscaling policy assumes small shared_buffers and large LFC. So we need to prewire LFC rather than shared_buffers.

Summary of changes

This PR implements the following:

  1. Add hook to checkpoint allowing to perform custom action during checkpoint.
  2. Save state of LFC as shutdown checkoint (metadata of neon.file_cache_prewarm_limit most frequently accessed chunks) in AUX file. This AUX file is automatically included in basebackup and loaded on startup
  3. Start background worker which loads pages of this chunks.

Possible concerns:

  1. AUX file size - limited by neon.file_cache_prewarm_limit
  2. Extra storage IO - LFC state is saved only once at shutdown
  3. Race conditions with LFC accessed by other backends - addressed using conservative policy using prewarm_requersted and prewarm_started flags to skip updates of LFC in case of possible risk of contention.

TODO:

  1. Use vector protocol for retrieving pages from PS once it will be available.
  2. Start multiple background workers to prewarm LFC in parallel

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@knizhnik knizhnik requested review from a team as code owners September 30, 2024 06:38
Copy link

github-actions bot commented Sep 30, 2024

5265 tests run: 5050 passed, 0 failed, 215 skipped (full report)


Flaky tests (1)

Postgres 17

Code coverage* (full report)

  • functions: 31.3% (7692 of 24546 functions)
  • lines: 48.8% (60471 of 123956 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
8852e8a at 2024-10-27T17:12:02.966Z :recycle:

endpoint.stop()
endpoint.start()

time.sleep(5) # give prewarm BGW some time to proceed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best if we were able to wait for this to warming to complete, as this 5s might not be enough under a different test suite ordering. This would help with the variable load of test suite on runners.

Perhaps the most natural way would be an sql function we could select over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that some extra API function is needed for waiting until LFC is prewarmed.
The main idea of prewarm is to perform it in background.
But I agree with you that it will be better to make test more reliable and less fluky.
This is why I replace 5 seconds sleep with the loop which checking number of prewired pages until it reaches specified threshold. Also it should reduce test time.

@knizhnik knizhnik mentioned this pull request Oct 23, 2024
5 tasks
@knizhnik
Copy link
Contributor Author

Replaced by #9587

@knizhnik knizhnik closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants