Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumption metrics faster after restart #4647

Closed
koivunej opened this issue Jul 6, 2023 · 3 comments
Closed

Consumption metrics faster after restart #4647

koivunej opened this issue Jul 6, 2023 · 3 comments
Labels
a/consumption_metrics c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests

Comments

@koivunej
Copy link
Member

koivunej commented Jul 6, 2023

Right now we start to collect consumption metrics every N minutes, and calculate synthetic size every M minutes for all tenants in sequence (order unspecified). In default config N == M, so there is a change that on first rounds after the restart:

  1. upload may have some synthetic sizes
  2. upload may will have some synthetic sizes
  3. upload may will have most synthetic sizes
  4. upload should have most if not all synthetic sizes

However, we now need due to external requirements reporting every N instead of possibly multiples of N. There have also been ideas of setting up an SLI for how often we upload the metrics -- I think the interesting case is "how often we upload these per tenant", not "how often a POST request goes out".


On #3542 I noticed that we also leak a bit of metrics for any detached tenants, as we never clear them. The problems with the current approach of listing all tenants and iterating them is that we lack the ability to clean things up once complete, or react to for example synthetic size calculation completing.

I think we should instead organize metrics sending so that:

  • task to buffer and collected metrics is spawned
  • each tenant gets a process collecting the metrics on a tick, including synthetic size
    • per tenant caching
    • synthetic size has a global rate limit already
    • synthetic size could be "ongoing" while other metrics are collected, and when it completes, we would just send it like other metrics to be buffered and sent
    • on tenant deactivating, we would exit the task
    • sending cached metrics also on a tick
  • the top-level consumption metrics task
    • spawning up new tasks for appeared tenants

I think this design would allow us fine grained control over metrics buffering (send everything at 1min, or at buffer size, more constant load on receiver), easy to see path to use the reqwest_retry using retry mechanism, etc.

The per-tenant tasks would be able to measure if we are keeping up with the goal of collecting all metrics every 5min, and have no leaks.

Pre-requisites:

  • event mechanism to note on when a tenant has been activated
@koivunej koivunej added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver t/tech_design_rfc Issue type: tech design RFC a/consumption_metrics labels Jul 6, 2023
@koivunej
Copy link
Member Author

koivunej commented Jul 6, 2023

Do we currently have an invariant on consumption metrics collection needing to happen near the same wall clock time, or could it be a per-tenant wall clock time?

As in, after a tenant is created, a clock is started, on N minutes we collect and send the first metrics over for buffering, after that L minutes of buffering, we'd POST them on.

On restart, a lot of clocks would be started at the same time because all local tenants are activated in the range of 2-15s.

@koivunej koivunej removed the t/tech_design_rfc Issue type: tech design RFC label Jul 17, 2023
@koivunej
Copy link
Member Author

Removed the accidentially added tech_design_rfc. At least I don't recall what I was thinking when selecting that.

@koivunej
Copy link
Member Author

koivunej commented Oct 2, 2023

With best effort persistent cache added in the work towards #5323 I think the problem has now shifted from getting the metrics faster on startup towards something else, if any. Closing this for now.

@koivunej koivunej closed this as completed Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/consumption_metrics c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

1 participant