Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Retry failed updates with exponential backoff #709

Merged
merged 17 commits into from
Oct 11, 2024
Merged

Conversation

blampe
Copy link
Contributor

@blampe blampe commented Oct 4, 2024

Currently, if our automation APIs call fail they return non-nil errors to the operator. In #676 I modified Update to translate these errors into a "failed" status on the Update/Stack, but other operations (preview etc.) still surface these errors and automatically re-queue.

We'd like to retry these failed updates much less aggressively than we retry transient network errors, for example. To accomplish this we do a few things:

  • We consolidate the update controller's streaming logic for consistent error handling across all operations.
  • We return errors with known gRPC status codes as-is, but unknown status codes are translated into failed results for all operations.
  • We start tracking the number of times a stack has attempted an update. This is used to determine how much exponential backoff to apply.
  • A failed update is considered synced for a cooldown period before we retry it. The cooldown period starts at 5 minutes and doubles for every failed attempt, eventually maxing out at 24 hours.

Fixes #677

Comment on lines 698 to 703
((instance.Status.LastUpdate.State == shared.SucceededStackStateMessage &&
(isStackMarkedToBeDeleted ||
(instance.Status.LastUpdate.LastSuccessfulCommit == currentCommit &&
(!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq)))) ||
(!isStackMarkedToBeDeleted &&
instance.Status.LastUpdate.State == shared.FailedStackStateMessage && time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < cooldown))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really hairy! I'll probably break this out and put a table test around it.

Copy link
Contributor

@EronWright EronWright Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your point, I don't understand why !isStackMarkedToBeDeleted is there, because I feel we should backoff in the destroy operation as we would for the update operation.

Maybe some inline funcs to give names to the sub-expressions? Like:

isUpToDate = func() bool {
	return instance.Status.LastUpdate.State == shared.SucceededStackStateMessage &&
		(isStackMarkedToBeDeleted || (instance.Status.LastUpdate.LastSuccessfulCommit == currentCommit && (!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq))
}
isCoolingDown = func() bool {
	return instance.Status.LastUpdate.State == shared.FailedStackStateMessage &&
		time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < cooldown)
}

synced := instance.Status.LastUpdate != nil &&
		instance.Status.LastUpdate.Generation == instance.Generation &&
                 (isUpToDate() || isCoolingDown())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isStackMarkedToBeDeleted was in here because we had a test that expected finalize to immediately retry a failed deletion, but we should do the same backoff in that case so I fixed the test.

I broke out isSynced into its own function to make it easier to unit test.

Copy link

codecov bot commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 78.17259% with 43 lines in your changes missing coverage. Please review.

Project coverage is 53.68%. Comparing base (83f8438) to head (f15ea11).
Report is 1 commits behind head on v2.

Files with missing lines Patch % Lines
...ator/internal/controller/auto/update_controller.go 71.11% 36 Missing and 3 partials ⚠️
...tor/internal/controller/pulumi/stack_controller.go 93.22% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v2     #709      +/-   ##
==========================================
+ Coverage   50.22%   53.68%   +3.46%     
==========================================
  Files          27       27              
  Lines        2919     2902      -17     
==========================================
+ Hits         1466     1558      +92     
+ Misses       1272     1164     -108     
+ Partials      181      180       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blampe blampe added the impact/no-changelog-required This issue doesn't require a CHANGELOG update label Oct 4, 2024
@blampe blampe requested review from EronWright and rquitales October 4, 2024 23:26
operator/api/pulumi/shared/stack_types.go Outdated Show resolved Hide resolved
operator/internal/controller/pulumi/stack_controller.go Outdated Show resolved Hide resolved
operator/internal/controller/pulumi/stack_controller.go Outdated Show resolved Hide resolved
operator/internal/controller/pulumi/stack_controller.go Outdated Show resolved Hide resolved
operator/internal/controller/pulumi/stack_controller.go Outdated Show resolved Hide resolved
operator/internal/controller/pulumi/stack_controller.go Outdated Show resolved Hide resolved
operator/internal/controller/pulumi/stack_controller.go Outdated Show resolved Hide resolved
@blampe blampe requested a review from EronWright October 8, 2024 21:08
operator/internal/controller/auto/update_controller.go Outdated Show resolved Hide resolved
operator/internal/controller/pulumi/stack_controller.go Outdated Show resolved Hide resolved
(!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq)))

if synced {
if isSynced(instance, currentCommit, isStackMarkedToBeDeleted) {
// transition to ready, and requeue reconciliation as necessary to detect
// branch updates and resyncs.
instance.Status.MarkReadyCondition()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If synced is true due to backoff, we shouldn't mark ourselves as ready, right? Maybe you'd want to check the lastUpdate's status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point especially re: the dependencies feature. I changed this to mark as stalled if the update failed.

Comment on lines 840 to 842
if last != nil && last.Generation == current.Generation {
instance.Status.LastUpdate.Failures = last.Failures
}
Copy link
Contributor

@EronWright EronWright Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate for resetting the failure count to zero here. Once the update is successful, retaining the failure count seems debatable. A couple of weird cases:

  1. In the case of periodic resync, the failure count stays at non-zero in perpetuity.
  2. In the case of sporatic errors (fail, success, fail), one might use backoff prematurely. Or is that intentional?

When a container recovers from a crashloop, does the status still reflect the situation or are "events" the only historical record at that point?

Copy link
Contributor Author

@blampe blampe Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of sporatic errors (fail, success, fail), one might use backoff prematurely. Or is that intentional?

Very good point, this isn't intentional. We should definitely reset here.

Comment on lines 922 to 924
cooldown = 5 * time.Minute
cooldown *= time.Duration(math.Exp2(float64(stack.Status.LastUpdate.Failures)))
cooldown = min(24*time.Hour, cooldown)
Copy link
Contributor

@EronWright EronWright Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not aggressive enough to cover short-term transient errors like a network connection error.

I suppose the stack could have spec elements for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowered the initial backoff to 1 minute.

Expect(err).NotTo(HaveOccurred())
// 5 minutes * 2^2
Expect(res.RequeueAfter).To(BeNumerically("~", time.Duration(20*time.Minute), time.Minute))
ByMarkingAsReady()
Copy link
Contributor

@EronWright EronWright Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the stack be ready when it is in backoff? Consider, for example, the Stack dependencies feature, where one stack waits for another to be ready. You wouldn't want to prematurely unblock a dependent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to reconciling.

@blampe blampe merged commit a4c8810 into v2 Oct 11, 2024
7 checks passed
@blampe blampe deleted the blampe/backoff branch October 11, 2024 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/no-changelog-required This issue doesn't require a CHANGELOG update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants