Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(per-tenant throttling): exclude throttled time from page_service metrics + regression test #6953

Merged
merged 25 commits into from
Mar 5, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Feb 28, 2024

part of #5899

Problem

Before this PR, the time spent waiting on the throttle was charged towards the higher-level page_service metrics, i.e., pageserver_smgr_query_seconds.
The metrics are the foundation of internal SLIs / SLOs.
A throttled tenant would cause the SLI to degrade / SLO alerts to fire.

Changes

  • don't charge time spent in throttle towards the page_service metrics
    • record time spent in throttle in RequestContext and subtract it from the elapsed time
    • this works because the page_service path doesn't create child context, so, all the throttle time is recorded in the parent
    • it's quite brittle and will break if we ever decide to spawn child tasks that need child RequestContexts, which would have separate instances of the micros_spent_throttled counter.
    • however, let's punt that to a more general refactoring of RequestContext
  • add a test case that ensures that
    • throttling happens for getpage requests; this aspect of the test passed before this PR
    • throttling delays aren't charged towards the page_service metrics; this aspect of the test only passes with this PR
  • drive-by: make the throttle log message info!, it's an expected condition

Performance

I took the same measurements as in #6706 , no meaningful change in CPU overhead.

Future Work

This PR enables us to experiment with the throttle for select tenants without affecting the SLI metrics / triggering SLO alerts.

Before declaring this feature done, we need more work to happen, specifically:

  • decide on whether we want to retain the flexibility of throttling any Timeline::get call, filtered by TaskKind
  • versus: separate throttles for each page_service endpoint, potentially with separate config options
  • the trouble here is that this decision implies changes to the TenantConfig, so, if we start using the current config style now, then decide to switch to a different config, it'll be a breaking change

Nice-to-haves but probably not worth the time right now:

  • Equivalent tests to ensure the throttle applies to all other page_service handlers.

@problame problame added the run-no-ci Don't run any CI for this PR. label Feb 28, 2024
Copy link

github-actions bot commented Feb 28, 2024

2490 tests run: 2368 passed, 0 failed, 122 skipped (full report)


Code coverage* (full report)

  • functions: 28.8% (6974 of 24232 functions)
  • lines: 47.3% (42710 of 90340 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2dcf5d8 at 2024-03-05T13:05:15.472Z :recycle:

@problame problame force-pushed the problame/throttling-observability-fixes branch from 3860565 to d417cb9 Compare February 28, 2024 19:15
@problame problame removed the run-no-ci Don't run any CI for this PR. label Feb 29, 2024
@problame problame force-pushed the problame/throttling-observability-fixes branch from d417cb9 to bccf459 Compare February 29, 2024 12:12
@problame problame force-pushed the problame/throttling-observability-fixes branch from d0eac41 to a5e0fe5 Compare February 29, 2024 17:43
…ke tests pass, maybe this was root cause of its flakiness
problame added a commit that referenced this pull request Feb 29, 2024
…questContextAdaptor` uses it (#6961)

Extracted from #6953

Part of #5899
problame added a commit that referenced this pull request Mar 1, 2024
…in callers (#6960)

Extracted from #6953

Part of #5899

Core Change
-----------

In #6953, we need the ability to scan the log _after_ a specific line
and ignore anything before that line.

This PR changes `log_contains` to returns a tuple of `(matching line,
cursor)`.
Hand that cursor to a subsequent `log_contains` call to search the log
for the next occurrence of the pattern.

Other Changes
-------------

- Inspect all the callsites of `log_contains` to handle the new tuple
return type.
- Above inspection unveiled many callers aren't using `assert
log_contains(...) is not None` but some weaker version of the code that
breaks if `log_contains` ever returns a not-None but falsy value. Fix
that.
- Above changes unveiled that `test_remote_storage_upload_queue_retries`
was using `wait_until` incorrectly; after fixing the usage, I had to
raise the `wait_until` timeout. So, maybe this will fix its flakiness.
@problame problame requested a review from jcsp March 1, 2024 10:04
@problame problame marked this pull request as ready for review March 1, 2024 10:04
@problame problame requested a review from a team as a code owner March 1, 2024 10:04
pageserver/src/metrics.rs Show resolved Hide resolved
pageserver/src/context.rs Show resolved Hide resolved
@problame problame force-pushed the problame/throttling-observability-fixes branch from 2ee223f to 41d57a9 Compare March 5, 2024 10:48
@problame problame force-pushed the problame/throttling-observability-fixes branch from 41d57a9 to 169065f Compare March 5, 2024 11:17
@problame problame enabled auto-merge (squash) March 5, 2024 12:20
@problame problame merged commit 270d3be into main Mar 5, 2024
53 checks passed
@problame problame deleted the problame/throttling-observability-fixes branch March 5, 2024 13:44
problame added a commit that referenced this pull request Mar 8, 2024
PR #6953 only excluded throttled time from the handle_pagerequests
(aka smgr metrics).

This PR implements the deduction for `basebackup ` queries.

The other page_service methods either don't use Timeline::get
or they aren't used in production.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants