Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: more elegant cancellation for remote operations #6096

Closed
Tracked by #5585
jcsp opened this issue Dec 11, 2023 · 5 comments · Fixed by #6697
Closed
Tracked by #5585

pageserver: more elegant cancellation for remote operations #6096

jcsp opened this issue Dec 11, 2023 · 5 comments · Fixed by #6697
Assignees
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Collaborator

jcsp commented Dec 11, 2023

  • Timeouts should be applied after accquiring the semaphore that limits concurrent remote operations. Applying a timeout elsewhere results in loss of fairness if operations are timing out waiting for the semaphore and thereby losing their place in the queue.
  • Remote operations should return an error type that distinguishes actual errors from shutdown, and callers should thereby avoid logging (+metric-counting) shutdown as if it were an error.
  • download_retry should understand cancellation errors from the inner function, and then we can avoid passing another cancellation token explicitly to download_retry: it can infer shutdown just from the result of the wrapped function.
@jcsp
Copy link
Collaborator Author

jcsp commented Dec 11, 2023

Related PR for adding cancellation tokens to the remote storage interface: #4781

jcsp added a commit that referenced this issue Dec 15, 2023
## Problem

Various places in remote storage were not subject to a timeout (thereby
stuck TCP connections could hold things up), and did not respect a
cancellation token (so things like timeline deletion or tenant detach
would have to wait arbitrarily long).



## Summary of changes

- Add download_cancellable and upload_cancellable helpers, and use them
in all the places we wait for remote storage operations (with the
exception of initdb downloads, where it would not have been safe).
- Add a cancellation token arg to `download_retry`.
- Use cancellation token args in various places that were missing one
per #5066

Closes: #5066 

Why is this only "basic" handling?
- Doesn't express difference between shutdown and errors in return
types, to avoid refactoring all the places that use an anyhow::Error
(these should all eventually return a more structured error type)
- Implements timeouts on top of remote storage, rather than within it:
this means that operations hitting their timeout will lose their
semaphore permit and thereby go to the back of the queue for their
retry.
- Doing a nicer job is tracked in
#6096
@koivunej koivunej self-assigned this Feb 5, 2024
@koivunej
Copy link
Member

koivunej commented Feb 5, 2024

#6618 takes care of extra CancellationToken cloning. I looked into implementing this and it would turn:

let remote_storage_operation = GenericRemoteStorage::something(million, args, cancel).await;

into

// this is callled from within backoff::retry

let new_nested_token = cancel.child_token();
let remote_storage_operation = GenericRemoteStorage::something(million, args, &new_nested_token);
let remote_storage_operation = std::pin::pin!(remote_storage_operation);

tokio::select! {
    res = &mut remote_storage_operation => res,
    _ = tokio::time::sleep(UPLOAD_TIMEOUT) => {
        new_nested_token.cancel();
        remote_storage_operation.await
  },
}

Leaving us at +1 cancellation token creation after #6618 removes 1-2 clones, becoming a zero sum. Looked that there is no real way to solve this at LocalFS level by spawning so that we could get the gracefulness and have only LocalFS pay for it. This would result in very much the same problem it could currently have.

The best alternative is to just add warning logging to LocalFS that if you get a test failure because of a cancellation (and too fast retry), it would pop out right away as something to consider instead of a very hard to reproduce problem.

Or do nothing and hope no one ever debugs test failure like this. Seems the dropguard logging would be easy to add.

@jcsp
Copy link
Collaborator Author

jcsp commented Feb 5, 2024

The best alternative is to just add warning logging to LocalFS that if you get a test failure because of a cancellation (and too fast retry), it would pop out right away as something to consider instead of a very hard to reproduce problem.

Sounds good, I agree with not letting LocalFS guide our design. It just has to work, not be graceful.

Leaving us at +1 cancellation token creation after #6618 removes 1-2 clones, becoming a zero sum.

Yep. I like getting rid of the spurious clone in #6618. If the net result is ~similar overhead then we're good, it's fine to pay some clone-level efficiency tax for cancellation, when the operation is a big remote HTTP request.

@jcsp
Copy link
Collaborator Author

jcsp commented Feb 5, 2024

This week:

  • John & Joonas to sync on the scope

koivunej added a commit that referenced this issue Feb 6, 2024
The solution we ended up for `backoff::retry` requires always cloning of
cancellation tokens even though there is just `.await`. Fix that, and
also turn the return type into `Option<Result<T, E>>` avoiding the need
for the `E::cancelled()` fn passed in.

Cc: #6096
koivunej added a commit that referenced this issue Feb 6, 2024
Fix cloning the serialized heatmap on every attempt by just turning it
into `bytes::Bytes` before clone so it will be a refcounted instead of
refcounting a vec clone later on.

Also fixes one cancellation token cloning I had missed in #6618.
Cc: #6096
koivunej added a commit that referenced this issue Feb 9, 2024
…6696)

This PR is preliminary cleanups and refactoring around `remote_storage`
for next PR which will move the timeouts and cancellation into
`remote_storage`.

Summary:
- smaller drive-by fixes
- code simplification
- refactor common parts like `DownloadError::is_permanent`
- align error types with `RemoteStorage::list_*` to use more
`download_retry` helper

Cc: #6096
@jcsp
Copy link
Collaborator Author

jcsp commented Feb 12, 2024

PR in flight: #6697

koivunej added a commit that referenced this issue Feb 14, 2024
Cancellation and timeouts are handled at remote_storage callsites, if
they are. However they should always be handled, because we've had
transient problems with remote storage connections.

- Add cancellation token to the `trait RemoteStorage` methods
- For `download*`, `list*` methods there is
`DownloadError::{Cancelled,Timeout}`
- For the rest now using `anyhow::Error`, it will have root cause
`remote_storage::TimeoutOrCancel::{Cancel,Timeout}`
- Both types have `::is_permanent` equivalent which should be passed to
`backoff::retry`
- New generic RemoteStorageConfig option `timeout`, defaults to 120s
- Start counting timeouts only after acquiring concurrency limiter
permit
- Cancellable permit acquiring
- Download stream timeout or cancellation is communicated via an
`std::io::Error`
- Exit backoff::retry by marking cancellation errors permanent

Fixes: #6096
Closes: #4781

Co-authored-by: arpad-m <[email protected]>
koivunej added a commit that referenced this issue Feb 21, 2024
As noticed in #6836 some occurances of error conversions were missed in
#6697:
- `std::io::Error` popped up by `tokio::io::copy_buf` containing
`DownloadError` was turned into `DownloadError::Other`
- similarly for secondary downloader errors

These changes come at the loss of pathname context.

Cc: #6096
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants