Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent rate() Function Output between Prometheus and Mimir on Histogram Data #9767

Open
richardmoe opened this issue Oct 29, 2024 · 7 comments

Comments

@richardmoe
Copy link

Describe the bug

We have observed unexpected behavior when using the rate() function on histogram metrics in Mimir compared to Prometheus. Specifically, we sporadically see a significant spike in the Mimir output that is not present in Prometheus.

To Reproduce

  1. Start Prometheus: 2.50.1 with remote write to Mimir
  2. Start Mimir 2.13
  3. Run histogram_quantile(0.99, sum by (le,pod) (rate(my_metric_bucket{service="my-service"}[1m])))

Expected behavior

Expected to see the same results in Prometheus and Mimir.

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: Helm
  • 2 Prometheus instances in a kubernetes cluster with remote write to Mimir

Additional Context

Image

@colega
Copy link
Contributor

colega commented Nov 8, 2024

Hello, does this always happen to you on the last sample?

Note that Mimir doesn't offer isolation, because of its distributed fashion. When series for different buckets of an histogram are written, there's a moment when some of them are written but others still aren't. If the query is executed at that specific moment, the histogram_quantile function may only see higher buckets but not the lower ones, thus increasing the p99 value.

There's no easy fix for this on classic histograms, and we are not planning to fix it because this issue doesn't exist in native histograms which are becoming stable now with the release of Prometheus 3.0.

Please, reopen the issue if you see this happening consistently in samples that were already written "a while ago".

@colega colega closed this as completed Nov 8, 2024
@richardmoe
Copy link
Author

Hi again, we can also see the issue in metrics written a while ago. Here is an example of graph over metrics written almost 2 weeks ago:

Image

@colega
Copy link
Contributor

colega commented Nov 11, 2024

In this case, I would recommend you digging down to a single histogram series and check what's going on with the buckets.

I would check one of the pods that differs, and query an instant query of that in grafana as: rec_api_request_latency_bucket{...}[$__range]. That will show you the raw data stored, and you could check what's going on.

@colega colega reopened this Nov 11, 2024
@richardmoe
Copy link
Author

richardmoe commented Nov 13, 2024

The data from an instant query looks pretty similar and I haven't been able to see any big difference there.

Image

@colega
Copy link
Contributor

colega commented Nov 14, 2024

You need to switch Format to Time series to render them as graphs, and I'd recommend you rendering both datasources on the same graph if you want to compare (use mixed data source, then choose a data-source per query).

@richardmoe
Copy link
Author

richardmoe commented Nov 15, 2024

To get a time series graph you need range or both type.
Image

Image

@colega
Copy link
Contributor

colega commented Nov 15, 2024

You're still rendering the histogram_quantile, that's why you can't render time series from an instant query, please see my suggestion above:

I would check one of the pods that differs, and query an instant query of that in grafana as: rec_api_request_latency_bucket{...}[$__range]

Something like this:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants