Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zeroed EndsAt timestamp #3805

Conversation

grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented Apr 12, 2024

This commit fixes a bug where firing alerts had zeroed EndsAt timestamps in notifications. This bug was introduced in another PR (#1686) to fix issues where alerts are resolved during a flush.

This commit takes a different approach where instead of removing the EndsAt timestamp from the alert, it uses the ResolvedAt and StatusAt methods with the timestamp in the context, not the current time.

{
  "receiver": "default",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "bar": "baz"
      },
      "annotations": {},
      "startsAt": "2024-04-12T14:34:46.324903+02:00",
      "endsAt": "2024-04-12T14:39:46.324903+02:00",
      "generatorURL": "",
      "fingerprint": "6ad8c192a44006c5"
    }
  ],
  "groupLabels": {},
  "commonLabels": {
    "bar": "baz"
  },
  "commonAnnotations": {},
  "externalURL": "http://127.0.0.1:9093",
  "version": "4",
  "groupKey": "{}:{}",
  "truncatedAlerts": 0
}

var (
alerts = ag.alerts.List()
alertsSlice = make(types.AlertSlice, 0, len(alerts))
now = time.Now()
)
for _, alert := range alerts {
a := *alert
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we can remove this, but will do so in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the intent correctly this is passing the reference time through the context and every notifier should use the .*At(now time.Time) variant. As we don't (plan) to modify the alert we don't need to take a deep copy (some notifiers call types.Alerts).

Dropping this copy might be hard. On line 528 (529) the UpdatedAt time is compared between the copy and the storage of the aggregation group. Without a copy these numbers will always be the same.

I am wondering if there are ways to make sure that integrations never call the Resolved(), Status()? Given that alertmanager is likely to see an increase in number of integrations that appears desirable to me to ease writing and reviewing these.

Given that the model.Alert is copied and one needs the reference time. What is your thought on creating something like a model.AlertsSnapshot that holds an immutable snapshot of the alert and the status/reference time? This would ease writing and reviewing the notifiers, protect the core from accidental modificiations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping this copy might be hard. On line 528 (529) the UpdatedAt time is compared between the copy and the storage of the aggregation group. Without a copy these numbers will always be the same.

Oh well spotted!

What is your thought on creating something like a model.AlertsSnapshot that holds an immutable snapshot of the alert and the status/reference time?

This is what I tried at first, but I found it to be quite an intrusive change (#3351 (comment)). I haven't ruled it out though - do you think that would be a better way to fix this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think longterm we are better off with a genuine snapshot. It eases writing integrations, it eases code review of them. I think some of the type changes we can probably automate with gofmt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

type AlertSnapshot struct {
	Alert
	// status of the alert at the time it was snapshot.
	status model.AlertStatus
}

func (a *AlertSnapshot) Resolved() bool {
	return a.status == model.AlertResolved
}

func (a *AlertSnapshot) Status() model.AlertStatus {
	return a.status
}

// Need to override String() as otherwise it will call Resolved()
// of model.Alert not AlertSnapshot.
func (a *AlertSnapshot) String() string {
	s := fmt.Sprintf("%s[%s]", a.Name(), a.Fingerprint().String()[:7])
	if a.Resolved() {
		return s + "[resolved]"
	}
	return s + "[active]"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but from my point of view the trade-off is whether we want all of these fields to be mutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh – I think I misunderstood. By mutable/immutable are you talking about the fields being package-public (capitalized) or package-private (starting with lowercase)?

Copy link
Contributor

@zecke zecke Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Starting with lower case. I wanted to see how an implementation of this looks like and got the below to compile (and pass most tests but it requires some clean ups).

For me the biggest benefit is that it creates an immutable snapshot and simplifies API. The notification handlers are unable to modify the underlying Alert and the API removes the *At(time.Time) functions and we don't have to remember using notify.Now.

Please see:
main...zecke:alertmanager:zecke-alternative-ends-at

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you have time to open a PR for this? @gotjosh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let me fix the formatting and remove some of the repeated code in the tests.

This commit bumps prometheus/common to v0.52.3. It has a breaking
change where the metric alertmanager_build_info has been renamed
to go_build_info as the metric has been moved from prometheus/common
to prometheus/client_golang and the namspace argument has been
removed.

Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana
Copy link
Contributor Author

Build fails as we need #3806.

This commit fixes a bug where firing alerts had zeroed EndsAt
timestamps in notifications. This bug was introduced in another
PR (prometheus#1686) to fix issues where alerts are resolved during a flush.

This commit takes a different approach where instead of removing
the EndsAt timestamp from the alert, it uses the ResolvedAt and
StatusAt methods with the timestamp in the context, not the current
time.

Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
var (
alerts = ag.alerts.List()
alertsSlice = make(types.AlertSlice, 0, len(alerts))
now = time.Now()
)
for _, alert := range alerts {
a := *alert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the intent correctly this is passing the reference time through the context and every notifier should use the .*At(now time.Time) variant. As we don't (plan) to modify the alert we don't need to take a deep copy (some notifiers call types.Alerts).

Dropping this copy might be hard. On line 528 (529) the UpdatedAt time is compared between the copy and the storage of the aggregation group. Without a copy these numbers will always be the same.

I am wondering if there are ways to make sure that integrations never call the Resolved(), Status()? Given that alertmanager is likely to see an increase in number of integrations that appears desirable to me to ease writing and reviewing these.

Given that the model.Alert is copied and one needs the reference time. What is your thought on creating something like a model.AlertsSnapshot that holds an immutable snapshot of the alert and the status/reference time? This would ease writing and reviewing the notifiers, protect the core from accidental modificiations?

@@ -526,7 +525,7 @@ func (ag *aggrGroup) flush(notify func(...*types.Alert) bool) {
level.Error(ag.logger).Log("msg", "failed to get alert", "err", err, "alert", a.String())
continue
}
if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
if a.ResolvedAt(now) && got.UpdatedAt == a.UpdatedAt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine. If got has any modification reflected in UpdatedAt then we do not remove the alert.

Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana grobinson-grafana marked this pull request as ready for review April 17, 2024 10:03
@grobinson-grafana
Copy link
Contributor Author

I'm closing this PR in favor of #3827.

@grobinson-grafana grobinson-grafana deleted the grobinson/fix-ends-at branch June 25, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants