-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consumption_metrics: next steps #5175
Labels
c/storage/pageserver
Component: storage: pageserver
Comments
Now that I think of it, #3485 requested the opposite of deduplication at configurable intervals, so the dedup was always in from the start. From the original epic #2941:
So the caching comes from there. Now it seems we no longer need it. Description changed. |
This was referenced Sep 13, 2023
koivunej
added a commit
that referenced
this issue
Sep 15, 2023
koivunej
added a commit
that referenced
this issue
Sep 15, 2023
koivunej
added a commit
that referenced
this issue
Sep 16, 2023
Write collected metrics to disk to recover previously sent metrics on restart. Recover the previously collected metrics during startup, send them over at right time - send cached synthetic size before actual is calculated - when `last_record_lsn` rolls back on startup - stay at last sent `written_size` metric - send `written_size_delta_bytes` metric as 0 Add test support: stateful verification of events in python tests. Fixes: #5206 Cc: #5175 (loggings, will be enhanced in follow-up)
koivunej
added a commit
that referenced
this issue
Sep 18, 2023
Split off from #5297. Builds upon #5326. Handles original review comments which I did not move to earlier split PRs. Completes test support for verifying events by notifying of the last batch of events. Adds cleaning up of tempfiles left because of an unlucky shutdown or SIGKILL. Finally closes #5175. Co-authored-by: Arpad Müller <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Next steps:
PRs:
Rationale for deduplication removal
Before configuration change (private repo PR) we were running with configuration:
This was thought to disable deduplication done in pageserver, but did not. Instead the
cached_metric_collection_interval="0s"
needs to be used. The deduplication was added originally as specified in #2941 in order to lower the load but since then the requirements have changed, and we are expected to send values even if there was no change.We cannot just remove the cache, because the incremental metric
MetricsKey::written_size_delta
needs the previous timestamp it was sent and the previousMetricsKey::written_size
.At minimum we should fix the tests to reflect this new configuration and assert that all metrics are always sent. This is a bit tricky because synthetic_size will not be sent if it is zero.
Rationale for more logging
Currently we don't know anything about how the loop is doing, how long it is taking. We could add something like #5174 so next time it would be easy to rule out executors being very busy and forcing us to miss ticks, or posting just being very slow.
Rationale for retrying
In the logs we see mistakenly judged 502 Bad Gateway responses as "remote refused metrics" which are most likely due to redeploys. Instead we should retry these. It seems proxy is running with similar upload code, so perhaps we should both use something better.
The text was updated successfully, but these errors were encountered: