-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2] Retry failed updates with exponential backoff #709
Conversation
((instance.Status.LastUpdate.State == shared.SucceededStackStateMessage && | ||
(isStackMarkedToBeDeleted || | ||
(instance.Status.LastUpdate.LastSuccessfulCommit == currentCommit && | ||
(!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq)))) || | ||
(!isStackMarkedToBeDeleted && | ||
instance.Status.LastUpdate.State == shared.FailedStackStateMessage && time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < cooldown)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really hairy! I'll probably break this out and put a table test around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your point, I don't understand why !isStackMarkedToBeDeleted
is there, because I feel we should backoff in the destroy operation as we would for the update operation.
Maybe some inline funcs to give names to the sub-expressions? Like:
isUpToDate = func() bool {
return instance.Status.LastUpdate.State == shared.SucceededStackStateMessage &&
(isStackMarkedToBeDeleted || (instance.Status.LastUpdate.LastSuccessfulCommit == currentCommit && (!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq))
}
isCoolingDown = func() bool {
return instance.Status.LastUpdate.State == shared.FailedStackStateMessage &&
time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < cooldown)
}
synced := instance.Status.LastUpdate != nil &&
instance.Status.LastUpdate.Generation == instance.Generation &&
(isUpToDate() || isCoolingDown())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!isStackMarkedToBeDeleted
was in here because we had a test that expected finalize to immediately retry a failed deletion, but we should do the same backoff in that case so I fixed the test.
I broke out isSynced
into its own function to make it easier to unit test.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v2 #709 +/- ##
==========================================
+ Coverage 50.22% 53.68% +3.46%
==========================================
Files 27 27
Lines 2919 2902 -17
==========================================
+ Hits 1466 1558 +92
+ Misses 1272 1164 -108
+ Partials 181 180 -1 ☔ View full report in Codecov by Sentry. |
(!sess.stack.ContinueResyncOnCommitMatch || time.Since(instance.Status.LastUpdate.LastResyncTime.Time) < resyncFreq))) | ||
|
||
if synced { | ||
if isSynced(instance, currentCommit, isStackMarkedToBeDeleted) { | ||
// transition to ready, and requeue reconciliation as necessary to detect | ||
// branch updates and resyncs. | ||
instance.Status.MarkReadyCondition() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If synced is true due to backoff, we shouldn't mark ourselves as ready, right? Maybe you'd want to check the lastUpdate's status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point especially re: the dependencies feature. I changed this to mark as stalled if the update failed.
if last != nil && last.Generation == current.Generation { | ||
instance.Status.LastUpdate.Failures = last.Failures | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advocate for resetting the failure count to zero here. Once the update is successful, retaining the failure count seems debatable. A couple of weird cases:
- In the case of periodic resync, the failure count stays at non-zero in perpetuity.
- In the case of sporatic errors (fail, success, fail), one might use backoff prematurely. Or is that intentional?
When a container recovers from a crashloop, does the status still reflect the situation or are "events" the only historical record at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of sporatic errors (fail, success, fail), one might use backoff prematurely. Or is that intentional?
Very good point, this isn't intentional. We should definitely reset here.
cooldown = 5 * time.Minute | ||
cooldown *= time.Duration(math.Exp2(float64(stack.Status.LastUpdate.Failures))) | ||
cooldown = min(24*time.Hour, cooldown) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not aggressive enough to cover short-term transient errors like a network connection error.
I suppose the stack could have spec elements for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowered the initial backoff to 1 minute.
Expect(err).NotTo(HaveOccurred()) | ||
// 5 minutes * 2^2 | ||
Expect(res.RequeueAfter).To(BeNumerically("~", time.Duration(20*time.Minute), time.Minute)) | ||
ByMarkingAsReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the stack be ready when it is in backoff? Consider, for example, the Stack dependencies feature, where one stack waits for another to be ready. You wouldn't want to prematurely unblock a dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to reconciling.
Currently, if our automation APIs call fail they return non-nil errors to the operator. In #676 I modified
Update
to translate these errors into a "failed" status on the Update/Stack, but other operations (preview etc.) still surface these errors and automatically re-queue.We'd like to retry these failed updates much less aggressively than we retry transient network errors, for example. To accomplish this we do a few things:
Fixes #677