Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move timeouts and cancellation handling to remote_storage #6697

Merged
merged 38 commits into from
Feb 14, 2024

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Feb 9, 2024

Cancellation and timeouts are handled at remote_storage callsites, if they are. However they should always be handled, because we've had transient problems with remote storage connections.

  • Add cancellation token to the trait RemoteStorage methods
    • For download*, list* methods there is DownloadError::{Cancelled,Timeout}
    • For the rest now using anyhow::Error, it will have root cause remote_storage::TimeoutOrCancel::{Cancel,Timeout}
    • Both types have ::is_permanent equivalent which should be passed to backoff::retry
  • New generic RemoteStorageConfig option timeout, defaults to 120s
  • Start counting timeouts only after acquiring concurrency limiter permit
  • Cancellable permit acquiring
  • Download stream timeout or cancellation is communicated via an std::io::Error
  • Exit backoff::retry by marking cancellation errors permanent

Fixes: #6096
Closes: #4781

safekeeper/src/wal_backup.rs Outdated Show resolved Hide resolved
safekeeper/src/wal_backup.rs Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Feb 9, 2024

2436 tests run: 2316 passed, 0 failed, 120 skipped (full report)


Flaky tests (4)

Postgres 15

  • test_create_snapshot: debug
  • test_sharding_split_unsharded: release

Postgres 14

  • test_sharding_split_unsharded: release
  • test_sharding_split_smoke: release

Code coverage (full report)

  • functions: 55.9% (12889 of 23067 functions)
  • lines: 82.4% (69922 of 84825 lines)

The comment gets automatically updated with the latest test results
31ade7f at 2024-02-14T21:01:48.911Z :recycle:

pageserver/src/tenant.rs Outdated Show resolved Hide resolved
Base automatically changed from move_timeout_cancel2 to main February 9, 2024 12:53
@arpad-m arpad-m self-requested a review February 9, 2024 13:59
I am unsure why that was even there; we do not need Unpin but also I am
surprised it ever was.
it returns an error on the FIRST timeout or cancel, which can be ignored
to disable it.
as it will be used as the root cause of anyhow, we have no choice but to
do the pointer chasing on queries.
@koivunej koivunej force-pushed the move_timeout_cancel3 branch 3 times, most recently from 1005167 to df74b57 Compare February 13, 2024 13:57
@koivunej koivunej force-pushed the move_timeout_cancel3 branch from e47e937 to eb2449b Compare February 13, 2024 15:09
@koivunej koivunej marked this pull request as ready for review February 13, 2024 15:54
@koivunej koivunej requested review from a team as code owners February 13, 2024 15:54
@koivunej koivunej force-pushed the move_timeout_cancel3 branch from eb2449b to a2eeb13 Compare February 14, 2024 08:33
@koivunej
Copy link
Member Author

Latest force push got rid of the extra zst file I had accidentially added and missed in review mode.

@koivunej koivunej enabled auto-merge (squash) February 14, 2024 20:21
@koivunej koivunej merged commit 80854b9 into main Feb 14, 2024
49 checks passed
@koivunej koivunej deleted the move_timeout_cancel3 branch February 14, 2024 23:24
koivunej added a commit that referenced this pull request Feb 21, 2024
As noticed in #6836 some occurances of error conversions were missed in
#6697:
- `std::io::Error` popped up by `tokio::io::copy_buf` containing
`DownloadError` was turned into `DownloadError::Other`
- similarly for secondary downloader errors

These changes come at the loss of pathname context.

Cc: #6096
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pageserver: more elegant cancellation for remote operations
2 participants