-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tokio locks in VirtualFile and turn with_file into macro #5247
Conversation
Sadly Rust doesn't support closures yet that take on parameters with lifetimes.
1644 tests run: 1570 passed, 0 failed, 74 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
e1518e2 at 2023-09-11T14:58:24.907Z :recycle: |
Another alternative to closures is to introduce a new guard object. So instead of:
The callers would look like this:
I guess in this case, the guard object would be just |
The idea with the guards was really great, I've moved most of the logic away from the macro into a function again that returns locks. There is still a macro to save repetition. I've also added back the metrics. In the worst case, if it turns out that we really can't use the macro with async ops, we can always remove it again: now it's pretty cheap to do so (there is still some minor level of repetition that it saves). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding request changes for this thread at least: https://github.com/neondatabase/neon/pull/5247/files#r1321117325
I still have issues with the tt-muncher and lack of Arc (unless I missed it) but now the metrics are back, and this needs to be rebased.
This is closer to what we had before. Again, sad that there is no support for async closures :).
regarding the tt muncher I have now implemented your suggestion to not use the dot syntax any more. For the lack of Arc, it's not really needed to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, it worked! I'm okay with this now, even if the Arc changes might be still required. Also drop spawns a task, it should be good.
## Problem Previously, we were using `observe_closure_duration` in `VirtualFile` file opening code, but this doesn't support async open operations, which we want to use as part of #4743. ## Summary of changes * Move the duration measurement from the `with_file` macro into a `observe_duration` macro. * Some smaller drive-by fixes to replace the old strings with the new variant names introduced by #5273 Part of #4743, follow-up of #5247.
Motivation ========== We observed two "indigestion" events on staging, each shortly after restarting `pageserver-0.eu-west-1.aws.neon.build`. It has ~8k tenants. The indigestion manifests as `Timeline::get` calls failing with `exceeded evict iter limit` . The error is from `page_cache.rs`; it was unable to find a free page and hence failed with the error. The indigestion events started occuring after we started deploying builds that contained the following commits: ``` [~/src/neon]: git log --oneline c0ed362..15eaf78 091da1a1c8b4f60ebf8 15eaf78 Disallow block_in_place and Handle::block_on (#5101) a18d6d9 Make File opening in VirtualFile async-compatible (#5280) 76cc873 Use tokio locks in VirtualFile and turn with_file into macro (#5247) ``` The second and third commit are interesting. They add .await points to the VirtualFile code. Background ========== On the read path, which is the dominant user of page cache & VirtualFile during pageserver restart, `Timeline::get` `page_cache` and VirtualFile interact as follows: 1. Timeline::get tries to read from a layer 2. This read goes through the page cache. 3. If we have a page miss (which is known to be common after restart), page_cache uses `find_victim` to find an empty slot, and once it has found a slot, it gives exclusive ownership of it to the caller through a `PageWriteGuard`. 4. The caller is supposed to fill the write guard with data from the underlying backing store, i.e., the layer `VirtualFile`. 5. So, we call into `VirtualFile::read_at`` to fill the write guard. The `find_victim` method finds an empty slot using a basic implementation of clock page replacement algorithm. Slots that are currently in use (`PageReadGuard` / `PageWriteGuard`) cannot become victims. If there have been too many iterations, `find_victim` gives up with error `exceeded evict iter limit`. Root Cause For Indigestion ========================== The second and third commit quoted in the "Motivation" section introduced `.await` points in the VirtualFile code. These enable tokio to preempt us and schedule another future __while__ we hold the `PageWriteGuard` and are calling `VirtualFile::read_at`. This was not possible before these commits, because there simply were no await points that weren't Poll::Ready immediately. With the offending commits, there is now actual usage of `tokio::sync::RwLock` to protect the VirtualFile file descriptor cache. And we __know__ from other experiments that, during the post-restart "rush", the VirtualFile fd cache __is__ too small, i.e., all slots are taken by _ongoing_ VirtualFile operations and cannot be victims. So, assume that VirtualFile's `find_victim_slot`'s `RwLock::write().await` calls _will_ yield control to the executor. The above can lead to the pathological situation if we have N runnable tokio tasks, each wanting to do `Timeline::get`, but only M slots, N >> M. Suppose M of the N tasks win a PageWriteGuard and get preempted at some .await point inside `VirtualFile::read_at`. Now suppose tokio schedules the remaining N-M tasks for fairness, then schedules the first M tasks again. Each of the N-M tasks will run `find_victim()` until it hits the `exceeded evict iter limit`. Why? Because the first M tasks took all the slots and are still holding them tight through their `PageWriteGuard`. The result is massive wastage of CPU time in `find_victim()`. The effort to find a page is futile, but each of the N-M tasks still attempts it. This delays the time when tokio gets around to schedule the first M tasks again. Eventually, tokio will schedule them, they will make progress, fill the `PageWriteGuard`, release it. But in the meantime, the N-M tasks have already bailed with error `exceeded evict iter limit`. Eventually, higher level mechanisms will retry for the N-M tasks, and this time, there won't be as many concurrent tasks wanting to do `Timeline::get`. So, it will shake out. But, it's a massive indigestion until then. This PR ======= This PR reverts the offending commits until we find a proper fix. ``` Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)" This reverts commit 76cc873. Revert "Make File opening in VirtualFile async-compatible (#5280)" This reverts commit a18d6d9. ```
Problem
For #4743, we want to convert everything up to the actual I/O operations of
VirtualFile
toasync fn
.Summary of changes
This PR is the last change in a series of changes to
VirtualFile
: #5189, #5190, #5195, #5203, and #5224.It does the last preparations before the I/O operations are actually made async. We are doing the following things:
VirtualFile
destructors similar to the one proposed by page_cache: use try_write in drop_buffers_for_immutable #5122 for the page cache, to usetry_write
. Similarly to the situation in the linked PR, one can make an argument that if we are in the destructor and the slot has not been reused yet, we are the only user accessing the slot due to owning the lock mutably. It is still possible that we are not obtaining the lock, but the only cause for that is the clock algorithm touching the slot, which should be quite an unlikely occurence. For the instance oftry_write
failing, we spawn an async task to destroy the lock. As just argued however, most of the time the code path where we spawn the task should not be visited.with_file
into a macro part, and a function part that contains most of the logic. The function part returns a lock object, that the macro uses. The macro exists to perform the operation in a more compact fashion, saving code from putting the lock into a variable and then doing the operation while measuring the time to run it. We take the locks approach because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures via parameters that has a lifetime (captures work). For details, see this and this link. In Make VirtualFile::{open, open_with_options, create,sync_all,with_file} async fn #5224, we ran into a similar problem with thetest_files
function, and we ended up passing the path and theOpenOptions
by-value instead of by-ref, at the expense of a few extra copies. This can be done as the data is cheaply copyable, and we are in test code. But here, we are not, and whileFile::try_clone
exists, it issues system calls internally. Also, it would allocate an entirely new file descriptor, something that the fd cache was built to prevent.STORAGE_IO_TIME
metrics to support async.Part of #4743.
Checklist before requesting a review
Checklist before merging