Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: download small objects using a smaller timeout #9938

Merged
merged 2 commits into from
Nov 29, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Nov 29, 2024

Problem

It appears that the Azure storage API tends to hang TCP connections more than S3 does.

Currently we use a 2 minute timeout for all downloads. This is large because sometimes the objects we download are large. However, waiting 2 minutes when doing something like downloading a manifest on tenant attach is problematic, because when someone is doing a "create tenant, create timeline" workflow, that 2 minutes is long enough for them reasonably to give up creating that timeline.

Rather than propagate oversized timeouts further up the stack, we should use a different timeout for objects that we expect to be small.

Closes: #9836

Summary of changes

  • Add a small_timeout configuration attribute to remote storage, defaulting to 30 seconds (still a very generous period to do something like download an index)
  • Add a DownloadKind parameter to DownloadOpts, so that callers can indicate whether they expect the object to be small or large.
  • In the azure client, use small timeout for HEAD requests, and for GET requests if DownloadKind::Small is used.
  • Use DownloadKind::Small for manifests, indices, and heatmap downloads.

This PR intentionally does not make the equivalent change to the S3 client, to reduce blast radius in case this has unexpected consequences (we could accomplish the same thing by editing lots of configs, but just skipping the code is simpler for right now)

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Nov 29, 2024
@jcsp jcsp marked this pull request as ready for review November 29, 2024 13:46
@jcsp jcsp requested review from a team as code owners November 29, 2024 13:46
@jcsp jcsp requested review from awarus and VladLazar November 29, 2024 13:46
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - some test would be nice just to avoid accidentally breaking this behaviour in the future.

@jcsp jcsp enabled auto-merge November 29, 2024 14:18
@jcsp jcsp added this pull request to the merge queue Nov 29, 2024
Merged via the queue into main with commit d5624cc Nov 29, 2024
84 checks passed
@jcsp jcsp deleted the jcsp/azure-small-timeout branch November 29, 2024 15:12
Copy link

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)


Flaky tests (3)

Postgres 16

Postgres 14

Code coverage* (full report)

  • functions: 30.3% (8186 of 27044 functions)
  • lines: 47.7% (64876 of 135957 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
a86455c at 2024-11-29T15:14:48.347Z :recycle:

awarus pushed a commit that referenced this pull request Dec 5, 2024
## Problem

It appears that the Azure storage API tends to hang TCP connections more
than S3 does.

Currently we use a 2 minute timeout for all downloads. This is large
because sometimes the objects we download are large. However, waiting 2
minutes when doing something like downloading a manifest on tenant
attach is problematic, because when someone is doing a "create tenant,
create timeline" workflow, that 2 minutes is long enough for them
reasonably to give up creating that timeline.

Rather than propagate oversized timeouts further up the stack, we should
use a different timeout for objects that we expect to be small.

Closes: #9836

## Summary of changes

- Add a `small_timeout` configuration attribute to remote storage,
defaulting to 30 seconds (still a very generous period to do something
like download an index)
- Add a DownloadKind parameter to DownloadOpts, so that callers can
indicate whether they expect the object to be small or large.
- In the azure client, use small timeout for HEAD requests, and for GET
requests if DownloadKind::Small is used.
- Use DownloadKind::Small for manifests, indices, and heatmap downloads.

This PR intentionally does not make the equivalent change to the S3
client, to reduce blast radius in case this has unexpected consequences
(we could accomplish the same thing by editing lots of configs, but just
skipping the code is simpler for right now)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remote_storage: Azure client doesn't tolerate server timeouts well
3 participants