Use tokio locks in VirtualFile and turn with_file into macro #5247

arpad-m · 2023-09-07T23:46:39Z

Problem

For #4743, we want to convert everything up to the actual I/O operations of VirtualFile to async fn.

Summary of changes

This PR is the last change in a series of changes to VirtualFile: #5189, #5190, #5195, #5203, and #5224.

It does the last preparations before the I/O operations are actually made async. We are doing the following things:

First, we change the locks for the file descriptor cache to tokio's locks that support Send. This is important when one wants to hold locks across await points (which we want to do), otherwise the Future won't be Send. Also, one shouldn't generally block in async code as executors don't like that.
Due to the lock change, we now take an approach for the VirtualFile destructors similar to the one proposed by page_cache: use try_write in drop_buffers_for_immutable #5122 for the page cache, to use try_write. Similarly to the situation in the linked PR, one can make an argument that if we are in the destructor and the slot has not been reused yet, we are the only user accessing the slot due to owning the lock mutably. It is still possible that we are not obtaining the lock, but the only cause for that is the clock algorithm touching the slot, which should be quite an unlikely occurence. For the instance of try_write failing, we spawn an async task to destroy the lock. As just argued however, most of the time the code path where we spawn the task should not be visited.
Lastly, we split with_file into a macro part, and a function part that contains most of the logic. The function part returns a lock object, that the macro uses. The macro exists to perform the operation in a more compact fashion, saving code from putting the lock into a variable and then doing the operation while measuring the time to run it. We take the locks approach because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures via parameters that has a lifetime (captures work). For details, see this and this link. In Make VirtualFile::{open, open_with_options, create,sync_all,with_file} async fn #5224, we ran into a similar problem with the test_files function, and we ended up passing the path and the OpenOptions by-value instead of by-ref, at the expense of a few extra copies. This can be done as the data is cheaply copyable, and we are in test code. But here, we are not, and while File::try_clone exists, it issues system calls internally. Also, it would allocate an entirely new file descriptor, something that the fd cache was built to prevent.
We change the STORAGE_IO_TIME metrics to support async.

Part of #4743.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

Sadly Rust doesn't support closures yet that take on parameters with lifetimes.

github-actions · 2023-09-07T23:52:31Z

1644 tests run: 1570 passed, 0 failed, 74 skipped (full report)

Code coverage (full report)

functions: 53.0% (7583 of 14311 functions)
lines: 81.5% (44787 of 54967 lines)

_{The comment gets automatically updated with the latest test results
e1518e2 at 2023-09-11T14:58:24.907Z :recycle:}

pageserver/src/virtual_file.rs

hlinnaka · 2023-09-08T07:23:28Z

Lastly, we switch with_file to a macro. Mostly we do this because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures that has a lifetime.

Another alternative to closures is to introduce a new guard object. So instead of:

self.with_file("foo", |file|.sync_all()).await?

The callers would look like this:

{
    let file_guard = self.with_file();
    file_guard.sync_all().await?
    // `file_guard` is dropped
}

I guess in this case, the guard object would be just slot_guard, or a wrapper around it.

pageserver/src/virtual_file.rs

arpad-m · 2023-09-09T00:29:03Z

The idea with the guards was really great, I've moved most of the logic away from the macro into a function again that returns locks. There is still a macro to save repetition. I've also added back the metrics.

In the worst case, if it turns out that we really can't use the macro with async ops, we can always remove it again: now it's pretty cheap to do so (there is still some minor level of repetition that it saves).

pageserver/src/virtual_file.rs

koivunej

Adding request changes for this thread at least: https://github.com/neondatabase/neon/pull/5247/files#r1321117325

I still have issues with the tt-muncher and lack of Arc (unless I missed it) but now the metrics are back, and this needs to be rebased.

…ync_6

This is closer to what we had before. Again, sad that there is no support for async closures :).

arpad-m · 2023-09-11T13:48:54Z

I still have issues with the tt-muncher and lack of Arc (unless I missed it) but now the metrics are back, and this needs to be rebased.

regarding the tt muncher I have now implemented your suggestion to not use the dot syntax any more. For the lack of Arc, it's not really needed to use Arcs here, instead I used lock objects as suggested by Heikki (already done that at the time you made that comment). For the rebase, I've done it via a git merge.

koivunej

Ha, it worked! I'm okay with this now, even if the Arc changes might be still required. Also drop spawns a task, it should be good.

## Problem Previously, we were using `observe_closure_duration` in `VirtualFile` file opening code, but this doesn't support async open operations, which we want to use as part of #4743. ## Summary of changes * Move the duration measurement from the `with_file` macro into a `observe_duration` macro. * Some smaller drive-by fixes to replace the old strings with the new variant names introduced by #5273 Part of #4743, follow-up of #5247.

…5247)" This reverts commit 76cc873.

Motivation ========== We observed two "indigestion" events on staging, each shortly after restarting `pageserver-0.eu-west-1.aws.neon.build`. It has ~8k tenants. The indigestion manifests as `Timeline::get` calls failing with `exceeded evict iter limit` . The error is from `page_cache.rs`; it was unable to find a free page and hence failed with the error. The indigestion events started occuring after we started deploying builds that contained the following commits: ``` [~/src/neon]: git log --oneline c0ed362..15eaf78 091da1a1c8b4f60ebf8 15eaf78 Disallow block_in_place and Handle::block_on (#5101) a18d6d9 Make File opening in VirtualFile async-compatible (#5280) 76cc873 Use tokio locks in VirtualFile and turn with_file into macro (#5247) ``` The second and third commit are interesting. They add .await points to the VirtualFile code. Background ========== On the read path, which is the dominant user of page cache & VirtualFile during pageserver restart, `Timeline::get` `page_cache` and VirtualFile interact as follows: 1. Timeline::get tries to read from a layer 2. This read goes through the page cache. 3. If we have a page miss (which is known to be common after restart), page_cache uses `find_victim` to find an empty slot, and once it has found a slot, it gives exclusive ownership of it to the caller through a `PageWriteGuard`. 4. The caller is supposed to fill the write guard with data from the underlying backing store, i.e., the layer `VirtualFile`. 5. So, we call into `VirtualFile::read_at`` to fill the write guard. The `find_victim` method finds an empty slot using a basic implementation of clock page replacement algorithm. Slots that are currently in use (`PageReadGuard` / `PageWriteGuard`) cannot become victims. If there have been too many iterations, `find_victim` gives up with error `exceeded evict iter limit`. Root Cause For Indigestion ========================== The second and third commit quoted in the "Motivation" section introduced `.await` points in the VirtualFile code. These enable tokio to preempt us and schedule another future __while__ we hold the `PageWriteGuard` and are calling `VirtualFile::read_at`. This was not possible before these commits, because there simply were no await points that weren't Poll::Ready immediately. With the offending commits, there is now actual usage of `tokio::sync::RwLock` to protect the VirtualFile file descriptor cache. And we __know__ from other experiments that, during the post-restart "rush", the VirtualFile fd cache __is__ too small, i.e., all slots are taken by _ongoing_ VirtualFile operations and cannot be victims. So, assume that VirtualFile's `find_victim_slot`'s `RwLock::write().await` calls _will_ yield control to the executor. The above can lead to the pathological situation if we have N runnable tokio tasks, each wanting to do `Timeline::get`, but only M slots, N >> M. Suppose M of the N tasks win a PageWriteGuard and get preempted at some .await point inside `VirtualFile::read_at`. Now suppose tokio schedules the remaining N-M tasks for fairness, then schedules the first M tasks again. Each of the N-M tasks will run `find_victim()` until it hits the `exceeded evict iter limit`. Why? Because the first M tasks took all the slots and are still holding them tight through their `PageWriteGuard`. The result is massive wastage of CPU time in `find_victim()`. The effort to find a page is futile, but each of the N-M tasks still attempts it. This delays the time when tokio gets around to schedule the first M tasks again. Eventually, tokio will schedule them, they will make progress, fill the `PageWriteGuard`, release it. But in the meantime, the N-M tasks have already bailed with error `exceeded evict iter limit`. Eventually, higher level mechanisms will retry for the N-M tasks, and this time, there won't be as many concurrent tasks wanting to do `Timeline::get`. So, it will shake out. But, it's a massive indigestion until then. This PR ======= This PR reverts the offending commits until we find a proper fix. ``` Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)" This reverts commit 76cc873. Revert "Make File opening in VirtualFile async-compatible (#5280)" This reverts commit a18d6d9. ```

arpad-m added 2 commits September 8, 2023 00:52

Switch the locks to tokio ones

b2acb73

Switch with_file to a macro

07d408f

Sadly Rust doesn't support closures yet that take on parameters with lifetimes.

arpad-m requested review from a team as code owners September 7, 2023 23:46

arpad-m requested review from knizhnik and shanyp and removed request for a team September 7, 2023 23:46

clippy

3da6676

This was referenced Sep 8, 2023

Epic: convert remaining IO stack for Timeline::get to async fn #4743

Closed

scope out virtualfile asyncification work #5158

Closed

koivunej reviewed Sep 8, 2023

View reviewed changes