[v2] Retry failed updates with exponential backoff #709

blampe · 2024-10-04T23:16:55Z

Currently, if our automation APIs call fail they return non-nil errors to the operator. In #676 I modified Update to translate these errors into a "failed" status on the Update/Stack, but other operations (preview etc.) still surface these errors and automatically re-queue.

We'd like to retry these failed updates much less aggressively than we retry transient network errors, for example. To accomplish this we do a few things:

We consolidate the update controller's streaming logic for consistent error handling across all operations.
We return errors with known gRPC status codes as-is, but unknown status codes are translated into failed results for all operations.
We start tracking the number of times a stack has attempted an update. This is used to determine how much exponential backoff to apply.
A failed update is considered synced for a cooldown period before we retry it. The cooldown period starts at 5 minutes and doubles for every failed attempt, eventually maxing out at 24 hours.

Fixes #677

blampe · 2024-10-04T23:20:37Z

operator/internal/controller/pulumi/stack_controller.go

+		((instance.Status.LastUpdate.State == shared.SucceededStackStateMessage &&
+			(isStackMarkedToBeDeleted ||
+				(instance.Status.LastUpdate.LastSuccessfulCommit == currentCommit &&
+					(!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq)))) ||
+			(!isStackMarkedToBeDeleted &&
+				instance.Status.LastUpdate.State == shared.FailedStackStateMessage && time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < cooldown))


This is really hairy! I'll probably break this out and put a table test around it.

To your point, I don't understand why !isStackMarkedToBeDeleted is there, because I feel we should backoff in the destroy operation as we would for the update operation.

Maybe some inline funcs to give names to the sub-expressions? Like:

isUpToDate = func() bool { return instance.Status.LastUpdate.State == shared.SucceededStackStateMessage && (isStackMarkedToBeDeleted || (instance.Status.LastUpdate.LastSuccessfulCommit == currentCommit && (!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq)) } isCoolingDown = func() bool { return instance.Status.LastUpdate.State == shared.FailedStackStateMessage && time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < cooldown) } synced := instance.Status.LastUpdate != nil && instance.Status.LastUpdate.Generation == instance.Generation && (isUpToDate() || isCoolingDown())

!isStackMarkedToBeDeleted was in here because we had a test that expected finalize to immediately retry a failed deletion, but we should do the same backoff in that case so I fixed the test.

I broke out isSynced into its own function to make it easier to unit test.

codecov · 2024-10-04T23:20:43Z

Codecov Report

Attention: Patch coverage is 78.17259% with 43 lines in your changes missing coverage. Please review.

Project coverage is 53.68%. Comparing base (83f8438) to head (f15ea11).
Report is 1 commits behind head on v2.

Files with missing lines	Patch %	Lines
...ator/internal/controller/auto/update_controller.go	71.11%	36 Missing and 3 partials ⚠️
...tor/internal/controller/pulumi/stack_controller.go	93.22%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##               v2     #709      +/-   ##
==========================================
+ Coverage   50.22%   53.68%   +3.46%     
==========================================
  Files          27       27              
  Lines        2919     2902      -17     
==========================================
+ Hits         1466     1558      +92     
+ Misses       1272     1164     -108     
+ Partials      181      180       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

operator/api/pulumi/shared/stack_types.go

operator/internal/controller/pulumi/stack_controller.go

…o blampe/backoff

operator/internal/controller/auto/update_controller.go

operator/internal/controller/pulumi/stack_controller.go

EronWright · 2024-10-09T16:18:02Z

operator/internal/controller/pulumi/stack_controller.go

-				(!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq)))
-
-	if synced {
+	if isSynced(instance, currentCommit, isStackMarkedToBeDeleted) {
 		// transition to ready, and requeue reconciliation as necessary to detect
 		// branch updates and resyncs.
 		instance.Status.MarkReadyCondition()


If synced is true due to backoff, we shouldn't mark ourselves as ready, right? Maybe you'd want to check the lastUpdate's status.

Good point especially re: the dependencies feature. I changed this to mark as stalled if the update failed.

EronWright · 2024-10-09T16:29:01Z

operator/internal/controller/pulumi/stack_controller.go

+	if last != nil && last.Generation == current.Generation {
+		instance.Status.LastUpdate.Failures = last.Failures
+	}


I would advocate for resetting the failure count to zero here. Once the update is successful, retaining the failure count seems debatable. A couple of weird cases:

In the case of periodic resync, the failure count stays at non-zero in perpetuity.

In the case of sporatic errors (fail, success, fail), one might use backoff prematurely. Or is that intentional?

When a container recovers from a crashloop, does the status still reflect the situation or are "events" the only historical record at that point?

In the case of sporatic errors (fail, success, fail), one might use backoff prematurely. Or is that intentional?

Very good point, this isn't intentional. We should definitely reset here.

EronWright · 2024-10-09T16:32:24Z

operator/internal/controller/pulumi/stack_controller.go

+		cooldown = 5 * time.Minute
+		cooldown *= time.Duration(math.Exp2(float64(stack.Status.LastUpdate.Failures)))
+		cooldown = min(24*time.Hour, cooldown)


This seems not aggressive enough to cover short-term transient errors like a network connection error.

I suppose the stack could have spec elements for this

Lowered the initial backoff to 1 minute.

EronWright · 2024-10-09T16:35:57Z

operator/internal/controller/pulumi/stack_controller_test.go

+					Expect(err).NotTo(HaveOccurred())
+					// 5 minutes * 2^2
+					Expect(res.RequeueAfter).To(BeNumerically("~", time.Duration(20*time.Minute), time.Minute))
+					ByMarkingAsReady()


Why would the stack be ready when it is in backoff? Consider, for example, the Stack dependencies feature, where one stack waits for another to be ready. You wouldn't want to prematurely unblock a dependent.

Changed to reconciling.

…o blampe/backoff

blampe added 4 commits October 4, 2024 15:57

min/max are built-ins

a5db19b

consolidating result logic

00dfe1b

consolidate streaming

4496c1b

backoff and tests

650e37b

blampe commented Oct 4, 2024

View reviewed changes

blampe added the impact/no-changelog-required This issue doesn't require a CHANGELOG update label Oct 4, 2024

blampe requested review from EronWright and rquitales October 4, 2024 23:26

EronWright requested changes Oct 7, 2024

View reviewed changes

blampe added 6 commits October 8, 2024 12:41

Merge branch 'v2' of github.com:pulumi/pulumi-kubernetes-operator int…

60f3dc4

…o blampe/backoff

s/attempts/failures

b0dbc62

backoff during finalization

ad7e87a

ensure requeueAfter is positive

33d30d4

reset failures on generation change

9b2c244

tests around isSynced

48d7961

blampe requested a review from EronWright October 8, 2024 21:08

EronWright reviewed Oct 9, 2024

View reviewed changes

blampe added 5 commits October 11, 2024 14:06

Merge branch 'v2' of github.com:pulumi/pulumi-kubernetes-operator int…

f1ceba5

…o blampe/backoff

start with a 1-minute cooldown instead of 5

b7a89c9

mark as stalled on failure

c11859a

handle additional status codes with slow backoff

deaeacf

reset failures on success

bc4ec24

blampe requested a review from EronWright October 11, 2024 21:54

mjeffryes assigned blampe Oct 11, 2024

mark as reconciling

b41e474

EronWright approved these changes Oct 11, 2024

View reviewed changes

s/StalledFailureReason/ReconcilingRetryReason

f15ea11

blampe merged commit a4c8810 into v2 Oct 11, 2024
7 checks passed

blampe deleted the blampe/backoff branch October 11, 2024 23:29

EronWright mentioned this pull request Oct 29, 2024

Reduce the frequency at which the Pulumi Operator hits the GitHub API #461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] Retry failed updates with exponential backoff #709

[v2] Retry failed updates with exponential backoff #709

blampe commented Oct 4, 2024

blampe Oct 4, 2024

EronWright Oct 7, 2024 •

edited

Loading

blampe Oct 8, 2024

codecov bot commented Oct 4, 2024 •

edited

Loading

EronWright Oct 9, 2024

blampe Oct 11, 2024

EronWright Oct 9, 2024 •

edited

Loading

blampe Oct 11, 2024 •

edited

Loading

EronWright Oct 9, 2024 •

edited

Loading

blampe Oct 11, 2024

EronWright Oct 9, 2024 •

edited

Loading

blampe Oct 11, 2024

[v2] Retry failed updates with exponential backoff #709

[v2] Retry failed updates with exponential backoff #709

Conversation

blampe commented Oct 4, 2024

blampe Oct 4, 2024

Choose a reason for hiding this comment

EronWright Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

blampe Oct 8, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 4, 2024 • edited Loading

Codecov Report

EronWright Oct 9, 2024

Choose a reason for hiding this comment

blampe Oct 11, 2024

Choose a reason for hiding this comment

EronWright Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

blampe Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

EronWright Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

blampe Oct 11, 2024

Choose a reason for hiding this comment

EronWright Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

blampe Oct 11, 2024

Choose a reason for hiding this comment

EronWright Oct 7, 2024 •

edited

Loading

codecov bot commented Oct 4, 2024 •

edited

Loading

EronWright Oct 9, 2024 •

edited

Loading

blampe Oct 11, 2024 •

edited

Loading

EronWright Oct 9, 2024 •

edited

Loading

EronWright Oct 9, 2024 •

edited

Loading