Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consumption_metrics: next steps #5175

Closed
koivunej opened this issue Sep 1, 2023 · 1 comment · Fixed by #5327
Closed

consumption_metrics: next steps #5175

koivunej opened this issue Sep 1, 2023 · 1 comment · Fixed by #5327
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

koivunej commented Sep 1, 2023

Next steps:

  • remove the deduplication possibility, fix tests to expect all metrics to be always sent
  • add more logging
  • fix metric posting retry, align with proxy

PRs:

Rationale for deduplication removal

Before configuration change (private repo PR) we were running with configuration:

metric_collection_interval="X"
cached_metric_collection_interval="X"

This was thought to disable deduplication done in pageserver, but did not. Instead the cached_metric_collection_interval="0s" needs to be used. The deduplication was added originally as specified in #2941 in order to lower the load but since then the requirements have changed, and we are expected to send values even if there was no change.

We cannot just remove the cache, because the incremental metric MetricsKey::written_size_delta needs the previous timestamp it was sent and the previous MetricsKey::written_size.

At minimum we should fix the tests to reflect this new configuration and assert that all metrics are always sent. This is a bit tricky because synthetic_size will not be sent if it is zero.

Rationale for more logging

Currently we don't know anything about how the loop is doing, how long it is taking. We could add something like #5174 so next time it would be easy to rule out executors being very busy and forcing us to miss ticks, or posting just being very slow.

Rationale for retrying

In the logs we see mistakenly judged 502 Bad Gateway responses as "remote refused metrics" which are most likely due to redeploys. Instead we should retry these. It seems proxy is running with similar upload code, so perhaps we should both use something better.

@koivunej koivunej added the c/storage/pageserver Component: storage: pageserver label Sep 1, 2023
@koivunej
Copy link
Member Author

koivunej commented Sep 1, 2023

Now that I think of it, #3485 requested the opposite of deduplication at configurable intervals, so the dedup was always in from the start. From the original epic #2941:

One way to do that is to send info about all tenants on the start, cache values, and then filter out unchanged ones.

So the caching comes from there. Now it seems we no longer need it.

Description changed.

@koivunej koivunej self-assigned this Sep 15, 2023
koivunej added a commit that referenced this issue Sep 15, 2023
We no longer use pageserver deduplication anywhere. Give out a warning
instead.

Split off from #5297.

Cc: #5175 for dedup.
koivunej added a commit that referenced this issue Sep 15, 2023
Split off from #5297. Depends on #5315.
Cc: #5175 for retry
koivunej added a commit that referenced this issue Sep 16, 2023
Write collected metrics to disk to recover previously sent metrics on
restart.

Recover the previously collected metrics during startup, send them over
at right time
  - send cached synthetic size before actual is calculated
  - when `last_record_lsn` rolls back on startup
      - stay at last sent `written_size` metric
      - send `written_size_delta_bytes` metric as 0

Add test support: stateful verification of events in python tests.

Fixes: #5206
Cc: #5175 (loggings, will be enhanced in follow-up)
koivunej added a commit that referenced this issue Sep 18, 2023
Split off from #5297. Builds upon #5326. Handles original review
comments which I did not move to earlier split PRs. Completes test
support for verifying events by notifying of the last batch of events.
Adds cleaning up of tempfiles left because of an unlucky shutdown or
SIGKILL.

Finally closes #5175.

Co-authored-by: Arpad Müller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
1 participant