Consumption metrics faster after restart #4647
Labels
a/consumption_metrics
c/storage/pageserver
Component: storage: pageserver
t/feature
Issue type: feature, for new features or requests
Right now we start to collect consumption metrics every N minutes, and calculate synthetic size every M minutes for all tenants in sequence (order unspecified). In default config N == M, so there is a change that on first rounds after the restart:
However, we now need due to external requirements reporting every N instead of possibly multiples of N. There have also been ideas of setting up an SLI for how often we upload the metrics -- I think the interesting case is "how often we upload these per tenant", not "how often a POST request goes out".
On #3542 I noticed that we also leak a bit of metrics for any detached tenants, as we never clear them. The problems with the current approach of listing all tenants and iterating them is that we lack the ability to clean things up once complete, or react to for example synthetic size calculation completing.
I think we should instead organize metrics sending so that:
I think this design would allow us fine grained control over metrics buffering (send everything at 1min, or at buffer size, more constant load on receiver), easy to see path to use the
reqwest_retry
using retry mechanism, etc.The per-tenant tasks would be able to measure if we are keeping up with the goal of collecting all metrics every 5min, and have no leaks.
Pre-requisites:
The text was updated successfully, but these errors were encountered: