-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add counter to track alerts dropped outside of time_intervals #3565
feat: add counter to track alerts dropped outside of time_intervals #3565
Conversation
@gotjosh this is an attempt to concisely shim in the new counter discussed in #3512, please lemme know your thoughts. Of note, I wonder if the naming of this new metric ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your contribution, the logic overall looks great but please look at my comments!
notify/notify.go
Outdated
@@ -251,6 +251,7 @@ type Metrics struct { | |||
numTotalFailedNotifications *prometheus.CounterVec | |||
numNotificationRequestsTotal *prometheus.CounterVec | |||
numNotificationRequestsFailedTotal *prometheus.CounterVec | |||
numAlertsSuppressedTotal prometheus.Counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for not catching this earlier on, but yes, what we want here is numNotificationSuppressedTotal
it's a bit confusing but we have strict nomenclature at a metric level and notification
is not the same thing as an alert
.
Alerts are what the Alertmanager receives and as you correctly figured out we already reflect those with the suppressed label on alertmanager_alerts
.
What we want here is alertmanager_notifications_supressed_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also an argument on wether we should use a label to specify if this was suppressed by a an active_interval
or a muted_interval
but let's leave this out for now and see wether this is useful on its own as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I like that distinction much better, I'll move towards this phrasing.
There's also an argument on wether we should use a label to specify if this was suppressed by a an active_interval or a muted_interval but let's leave this out for now and see wether this is useful on its own as is.
Agreed, and sounds good to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in e15292f
notify/notify.go
Outdated
@@ -902,6 +909,7 @@ func (tms TimeMuteStage) Exec(ctx context.Context, l log.Logger, alerts ...*type | |||
|
|||
// If the current time is inside a mute time, all alerts are removed from the pipeline. | |||
if muted { | |||
tms.metrics.numAlertsSuppressedTotal.Add(float64(len(alerts))) | |||
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time") | |
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time", "alerts", len(alerts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're here let's try and log the number of alerts we suppressed for traceability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in e15292f
notify/notify.go
Outdated
@@ -939,6 +947,7 @@ func (tas TimeActiveStage) Exec(ctx context.Context, l log.Logger, alerts ...*ty | |||
|
|||
// If the current time is not inside an active time, all alerts are removed from the pipeline | |||
if !muted { | |||
tas.metrics.numAlertsSuppressedTotal.Add(float64(len(alerts))) | |||
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time") | |
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time", "alerts", len(alerts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in e15292f
@gotjosh ready for round 2 when time permits |
Signed-off-by: TJ Hoplock <[email protected]>
e15292f
to
2562373
Compare
Accidentally forgot to sign the last commit. Amended and force pushed to address |
notify/notify.go
Outdated
@@ -284,6 +285,11 @@ func NewMetrics(r prometheus.Registerer, ff featurecontrol.Flagger) *Metrics { | |||
Name: "notification_requests_failed_total", | |||
Help: "The total number of failed notification requests.", | |||
}, labels), | |||
numNotificationSuppressedTotal: prometheus.NewCounter(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to follow the other comment, we could add a reason
label here. Then, users can split up the individual rates based on what did the suppression (was it silenced, or part of a mute time? etc).
It may be worth adding this even if we don't follow my other comment - one reason for active times, and another for mute times, in case users are using both simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said in response to the other comment, I can add the metrics to the other stages as well. And agreed, I think a reason
label is much more obviously valuable now
notify/notify.go
Outdated
tas := NewTimeActiveStage(intervener) | ||
tms := NewTimeMuteStage(intervener) | ||
tas := NewTimeActiveStage(intervener, pb.metrics) | ||
tms := NewTimeMuteStage(intervener, pb.metrics) | ||
ss := NewMuteStage(silencer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered adding silenced and inhibited notifications as part of this metric? I.e. we could also pass the metrics into NewMuteStage
here for silences and on line 390 for inhibitions, and increment based on muted
in that stage as well.
I ask because we didn't specifically tie the metric to mutes (it's not called num_notifications_muted_total
) and then users could track suppressions across all different methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to that 👍
Addresses: prometheus#3512 This adds a new counter metric `alertmanager_alerts_supressed_total` that is incremented by `len(alerts)` when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals. Signed-off-by: TJ Hoplock <[email protected]>
Signed-off-by: TJ Hoplock <[email protected]>
Signed-off-by: TJ Hoplock <[email protected]>
Signed-off-by: TJ Hoplock <[email protected]>
Based on PR feedback: https://github.com/prometheus/alertmanager/pull/3565/files#r1393068026 Signed-off-by: TJ Hoplock <[email protected]>
3169b8d
to
143c833
Compare
Rebased to handle merge conflict 👍 |
- fixed metric count check to properly check the diff between input/output notifications from the suppression to compare to suppression metric, was previously inverted to compare to how many notifications it suppressed. - stopped using `Reset()` to compare collection counts between the multiple stages that are executed in `TestMuteStageWithSilences()`. the intent was to compare a clean metric collection after each stage execution, but the final stage where all silences are lifted results in no metric being created in the test, causing `prom_testutil.ToFloat64()` to panic. changed to separate vars to check counts between each stage, with care to consider prior counts. Signed-off-by: TJ Hoplock <[email protected]>
@alexweav does the current change set more resemble what you had in mind? Lemme know your thoughts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you very much for your contribution!
Signed-off-by: gotjosh <[email protected]>
This commit updates Alertmanager from f69a508 to d352d16. It has the following changes: - prometheus/alertmanager#3565 - prometheus/alertmanager#3718 - prometheus/alertmanager#3707 - prometheus/alertmanager#3719 - prometheus/alertmanager#3592 - prometheus/alertmanager#3572 - prometheus/alertmanager#3722
* Alertmanager: Update Alertmanager to commit d352d16 This commit updates Alertmanager from f69a508 to d352d16. It has the following changes: - prometheus/alertmanager#3565 - prometheus/alertmanager#3718 - prometheus/alertmanager#3707 - prometheus/alertmanager#3719 - prometheus/alertmanager#3592 - prometheus/alertmanager#3572 - prometheus/alertmanager#3722
…rometheus#3565) * feat: add counter to track alerts dropped outside of time_intervals Addresses: prometheus#3512 This adds a new counter metric `alertmanager_alerts_supressed_total` that is incremented by `len(alerts)` when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals. Signed-off-by: TJ Hoplock <[email protected]> * test: add time interval suppression metric checks for notify Signed-off-by: TJ Hoplock <[email protected]> * test: fix failure message log values in notifier Signed-off-by: TJ Hoplock <[email protected]> * ref: address PR feedback for prometheus#3565 Signed-off-by: TJ Hoplock <[email protected]> * fix: track suppressed notifications metric for inhibit/silence Based on PR feedback: https://github.com/prometheus/alertmanager/pull/3565/files#r1393068026 Signed-off-by: TJ Hoplock <[email protected]> * fix: broken notifier tests - fixed metric count check to properly check the diff between input/output notifications from the suppression to compare to suppression metric, was previously inverted to compare to how many notifications it suppressed. - stopped using `Reset()` to compare collection counts between the multiple stages that are executed in `TestMuteStageWithSilences()`. the intent was to compare a clean metric collection after each stage execution, but the final stage where all silences are lifted results in no metric being created in the test, causing `prom_testutil.ToFloat64()` to panic. changed to separate vars to check counts between each stage, with care to consider prior counts. Signed-off-by: TJ Hoplock <[email protected]> * rename metric and add constants Signed-off-by: gotjosh <[email protected]> --------- Signed-off-by: TJ Hoplock <[email protected]> Signed-off-by: gotjosh <[email protected]> Co-authored-by: gotjosh <[email protected]> Signed-off-by: Gokhan Sari <[email protected]>
* [CHANGE] Deprecate and remove api/v1/ #2970 * [CHANGE] Remove unused feature flags #3676 * [CHANGE] Newlines in smtp password file are now ignored #3681 * [CHANGE] Change compat metrics to counters #3686 * [CHANGE] Do not register compat metrics in amtool #3713 * [CHANGE] Remove metrics from compat package #3714 * [CHANGE] Mark muted alerts #3793 * [FEATURE] Add metric for inhibit rules #3681 * [FEATURE] Support UTF-8 label matchers #3453, #3507, #3523, #3483, #3567, #3568, #3569, #3571, #3595, #3604, #3619, #3658, #3659, #3662, #3668, 3572 * [FEATURE] Add counter to track alerts dropped outside of time_intervals #3565 * [FEATURE] Add date and tz functions to templates #3812 * [FEATURE] Add limits for silences #3852 * [FEATURE] Add time helpers for templates #3863 * [FEATURE] Add auto GOMAXPROCS #3837 * [FEATURE] Add auto GOMEMLIMIT #3895 * [FEATURE] Add Jira receiver integration #3590 * [ENHANCEMENT] Add the receiver name to notification metrics #3045 * [ENHANCEMENT] Add the route ID to uuid #3372 * [ENHANCEMENT] Add duration to the notify success message #3559 * [ENHANCEMENT] Implement webhook_url_file for discord and msteams #3555 * [ENHANCEMENT] Add debug logs for muted alerts #3558 * [ENHANCEMENT] API: Allow the Silences API to use their own 400 response #3610 * [ENHANCEMENT] Add summary to msteams notification #3616 * [ENHANCEMENT] Add context reasons to notifications failed counter #3631 * [ENHANCEMENT] Add optional native histogram support to latency metrics #3737 * [ENHANCEMENT] Enable setting ThreadId for Telegram notifications #3638 * [ENHANCEMENT] Allow webex roomID from template #3801 * [BUGFIX] Add missing integrations to notify metrics #3480 * [BUGFIX] Add missing ttl in pushhover #3474 * [BUGFIX] Fix scheme required for webhook url in amtool #3409 * [BUGFIX] Remove duplicate integration from metrics #3516 * [BUGFIX] Reflect Discord's max length message limits #3597 * [BUGFIX] Fix nil error in warn logs about incompatible matchers #3683 * [BUGFIX] Fix a small number of inconsistencies in compat package logging #3718 * [BUGFIX] Fix log line in featurecontrol #3719 * [BUGFIX] Fix panic in acceptance tests #3592 * [BUGFIX] Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection #3722 * [BUGFIX] Fix crash on errors when url_file is used #3800 * [BUGFIX] Fix race condition in dispatch.go #3826 * [BUGFIX] Fix race conditions in the memory alerts store #3648 * [BUGFIX] Hide config.SecretURL when the URL is incorrect. #3887 * [BUGFIX] Fix invalid silence causes incomplete updates #3898 * [BUGFIX] Fix leaking of Silences matcherCache entries #3930 * [BUGFIX] Close SMTP submission correctly to handle errors #4006 Signed-off-by: SuperQ <[email protected]>
Addresses: #3512
This adds a new counter metric
alertmanager_alerts_supressed_total
that is incremented bylen(alerts)
when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals.