Consumption metrics faster after restart #4647

koivunej · 2023-07-06T09:22:54Z

Right now we start to collect consumption metrics every N minutes, and calculate synthetic size every M minutes for all tenants in sequence (order unspecified). In default config N == M, so there is a change that on first rounds after the restart:

upload may have some synthetic sizes
upload may will have some synthetic sizes
upload may will have most synthetic sizes
upload should have most if not all synthetic sizes

However, we now need due to external requirements reporting every N instead of possibly multiples of N. There have also been ideas of setting up an SLI for how often we upload the metrics -- I think the interesting case is "how often we upload these per tenant", not "how often a POST request goes out".

On #3542 I noticed that we also leak a bit of metrics for any detached tenants, as we never clear them. The problems with the current approach of listing all tenants and iterating them is that we lack the ability to clean things up once complete, or react to for example synthetic size calculation completing.

I think we should instead organize metrics sending so that:

task to buffer and collected metrics is spawned
each tenant gets a process collecting the metrics on a tick, including synthetic size
- per tenant caching
- synthetic size has a global rate limit already
- synthetic size could be "ongoing" while other metrics are collected, and when it completes, we would just send it like other metrics to be buffered and sent
- on tenant deactivating, we would exit the task
- sending cached metrics also on a tick
the top-level consumption metrics task
- spawning up new tasks for appeared tenants

I think this design would allow us fine grained control over metrics buffering (send everything at 1min, or at buffer size, more constant load on receiver), easy to see path to use the reqwest_retry using retry mechanism, etc.

The per-tenant tasks would be able to measure if we are keeping up with the goal of collecting all metrics every 5min, and have no leaks.

Pre-requisites:

event mechanism to note on when a tenant has been activated

The text was updated successfully, but these errors were encountered:

koivunej · 2023-07-06T09:30:56Z

Do we currently have an invariant on consumption metrics collection needing to happen near the same wall clock time, or could it be a per-tenant wall clock time?

As in, after a tenant is created, a clock is started, on N minutes we collect and send the first metrics over for buffering, after that L minutes of buffering, we'd POST them on.

On restart, a lot of clocks would be started at the same time because all local tenants are activated in the range of 2-15s.

koivunej · 2023-07-17T11:36:43Z

Removed the accidentially added tech_design_rfc. At least I don't recall what I was thinking when selecting that.

koivunej · 2023-10-02T10:35:10Z

With best effort persistent cache added in the work towards #5323 I think the problem has now shifted from getting the metrics faster on startup towards something else, if any. Closing this for now.

koivunej added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver t/tech_design_rfc Issue type: tech design RFC a/consumption_metrics labels Jul 6, 2023

koivunej removed the t/tech_design_rfc Issue type: tech design RFC label Jul 17, 2023

koivunej closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumption metrics faster after restart #4647

Consumption metrics faster after restart #4647

koivunej commented Jul 6, 2023 •

edited

Loading

koivunej commented Jul 6, 2023

koivunej commented Jul 17, 2023

koivunej commented Oct 2, 2023

Consumption metrics faster after restart #4647

Consumption metrics faster after restart #4647

Comments

koivunej commented Jul 6, 2023 • edited Loading

koivunej commented Jul 6, 2023

koivunej commented Jul 17, 2023

koivunej commented Oct 2, 2023

koivunej commented Jul 6, 2023 •

edited

Loading