Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to honor_timestamps #3

Open
justin-watkinson-sp opened this issue Jun 26, 2019 · 9 comments
Open

Add support to honor_timestamps #3

justin-watkinson-sp opened this issue Jun 26, 2019 · 9 comments

Comments

@justin-watkinson-sp
Copy link

Ran into a situation today where it looks like the exporter got "stuck" and reported the same data point for a period of several minutes. Nothing was logged, so it was hard to say if it was due to a service error or anything like that.

I believe adding honor_timestamp support may help provide a more accurate reflection of the data points to Prometheus and prevent a stagnant value from staying flat for a period of time.

Not sure if anyone else has run into something similar and has a different cause - I'll let you know if I find something.

@justin-watkinson-sp
Copy link
Author

I think I've found the root cause of the situation I ran into above.

It looks like on occasion the GetMetricData response does not include data for a requested metric, and as written, it appears that the /metric http handler would return the most recent result (even if stale). It looks like it can stay in the state for a period of time (I've observed 15-20 minutes of a flat line in Prometheus). This is specific to RDS CPU utilization in case that's helpful...

I see a few ideas which have different trade-offs:

  • Honor timestamps - should tell Prometheus that the data hasn't been refreshed, but prevent flat metrics for a period of time.
  • Removing anything not returned (via client delete), but expected.
  • Increasing the start to start from further behind now to increase changes that GetMetricData response has at least one element in the response for a particular metric. Possibly combined with honor_timestamp, but also increases potential for round trips/pagination/API calls.
  • Set a default value (i.e. -1), and hopefully -1 never has any actual meaning.

@frioux
Copy link
Contributor

frioux commented Jun 28, 2019

I don't think default values could ever be correct, unless we used nil (which I have verified works, though not what prometheus thinks about it.)

Honor timestamps seems ok to me; any idea how you do that?

@justin-watkinson-sp
Copy link
Author

Agree on default value, was just tossing several options out there. Nil, I believe, would just use the zero-value in Prometheus (though I didn't try that case).

I did a little testing of honor_timestamps this morning. You have to be able to handle the Metric struct directly (or at least I didn't see any way with the CounterVec/GaugeVec structs, the bits you need a private).

By making my own implementation of Collector, I can then handle the channel of metrics more directly. They provide a method to convert a Metric to a new Metric with a timestamp in the client library, and you just push that to the channel of Metrics.

It does have limits on being able to take a sample too far in the past or future (probably a good thing in this case as well). And if you emit a new value with the same exact timestamp, it will reject the new value, so something to consider as well.

There's probably a more elegant way that I haven't found yet. I was testing this alongside a case similar to how ListMetrics is used as "background cron metadata" and being able to unregister and re-register a new collector to ensure old metrics are forgotten.

@frioux
Copy link
Contributor

frioux commented Oct 30, 2019

BTW I think this might be solved by a PR from @gbegen that I just merged. Can you check and see?

@frioux
Copy link
Contributor

frioux commented Oct 30, 2019

#5, to be clear

@justin-watkinson-sp
Copy link
Author

Hey! Thanks for getting back to me. TLDR, been a bit busy with a new little one in my family, so a little less time for fun tech stuff.

Taking a look at the change, I do believe this is a step in the right direction for sure. The ability to know when we're getting the last known metric vs no current value is indeed important.

So the initial issue reported here is indeed resolved. Though learning more about Prometheus in the past few months, the "honor_timestamp" part may be a nice feature request. Since this exporter is essentially a proxy for CloudWatch, emitting the metrics with true CloudWatch timestamps would be ideal. But there still seems to be some concern around the eventual consistency of CloudWatch data (and recently blogged by Brian as well):

https://www.robustperception.io/getting-cloudwatch-data-into-prometheus

I have been assured it's not really 10 minutes bad, but not necessarily 1 minute good either (in practice). Curious your thoughts on this problem. I had to solve a similar problem for ECS and its event stream, where the timestamp of the event is an honor_timestamp.

@frioux
Copy link
Contributor

frioux commented Oct 31, 2019

Congratulations!

We've definitely ran into eventual issues with CW, but in my experience it's either there within 1, 5 or 10 minutes, or it never arrives at all. I think support for honor_timestamps would be nice though, so we can definitely leave this open as an idea for others.

@justin-watkinson-sp
Copy link
Author

So I've written one thing that uses honor_timestamps, which takes the ECS data (desired, running) and also consumes the event stream with certain events (e.g. can't find container image, etc.) and emits all of those as metrics with the honor_timestamps feature in full use. The pro here is that the event can display the accurate time of the event (and not the scrape time, which is the Prometheus default).

However, I couldn't find a way to accomplish adding the timestamps without simply skipping the convenience of the counter and gauge vectors, and simply implementing the Collector interface and pushing the raw metrics to the channel. Otherwise, the pattern looks very similar to cloudwatching as you have done here. I think it'd work really well for CloudWatch, too, since we could emit CloudWatch's time. Seems like a really good fit for these "proxy metrics" such as CloudWatch.

Would be happy to discuss further with you to line up vision and see if our needs align. I pictured an ideal exporter being able to emit metrics for the last N periods (5 min @ 60s in most case, save for SQS), and as long as the values don't change, Prometheus shouldn't care. Or we emit the most recent with a different labelset, and re-emit the 10min+ data under a different labelset for the "fully consistent" view. I'm a little less confident in that, since that means potentially a lot of confusion of what a particular metric means...

In any case, really have enjoyed what you've all been doing here. Works great and as expected. Depending on how this comes out, I'll see if I can dive in and help offers some improvements.

@frioux
Copy link
Contributor

frioux commented Nov 17, 2019

I couldn't find a way to accomplish adding the timestamps without simply skipping the convenience of the counter and gauge vectors, and simply implementing the Collector interface and pushing the raw metrics to the channel.

Yeah I figured this might be the case. When I initially wrote cloudwatching I assumed I'd be able to simply have a "raw" metric and be able to set all this stuff, but I couldn't without implementing the Collector interface, which seemed like too much work at the time. A patch doing that would be totally fine, as it would likely just make all of our code a little more correct anyway.

I pictured an ideal exporter being able to emit metrics for the last N periods (5 min @ 60s in most case, save for SQS), and as long as the values don't change, Prometheus shouldn't care. Or we emit the most recent with a different labelset, and re-emit the 10min+ data under a different labelset for the "fully consistent" view. I'm a little less confident in that, since that means potentially a lot of confusion of what a particular metric means...

I am interested in this but not sure about the implications. Let's say, just for simplicity, that we include the timestamps and are surfacing the 5m metrics. Do we end up with gaps between minute zero and minute five? And when would you, as a user, want all of the 1m, 5m, and 10m metrics? I guess you might record all three for a single CW metric as a kind of failsafe for if the 1m and 5m metrics are missing or something?

In any case I think this could be useful, but we just probably need to make sure it's configurable so we don't amplify costs for features people may not use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants