-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #8132: managed fields failure during Restore #8133
base: main
Are you sure you want to change the base?
Conversation
/kind changelog-not-required |
I would prefer if this is backported to 1.13 as well |
@mpryc |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8133 +/- ##
==========================================
+ Coverage 58.99% 59.07% +0.08%
==========================================
Files 364 364
Lines 30270 30310 +40
==========================================
+ Hits 17858 17906 +48
+ Misses 10965 10959 -6
+ Partials 1447 1445 -2 ☔ View full report in Codecov by Sentry. |
Potential for scope overlap with |
d4ec8f2
to
097a9f8
Compare
Added, dunno how to remove |
I am not convinced about that. The #8063 is about retry on status update, this fix is about race condition on a particular object during restore operation and applying managed fields. It could be a general design rework of restore operation, but that's bigger chunk of work. In short this fix addresses situation when after first time the object is created in the cluster at: Line 1503 in f63b714
And before the patch for managed fields is being calculated at: Line 1672 in f63b714
There are number of operations on the in-cluster object including status update and it happens that the object being patched is not the one that represents current cluster version. This is of course done to save API calls to the cluster and we should only re-try such operation when there is real error. I believe this is not really what you are looking into within #8063 as we are explicitly retrying on the object conflict and not other problems such as non reachable cluster API: https://pkg.go.dev/k8s.io/client-go/util/retry#RetryOnConflict |
@mpryc I removed the changelog-not-required label |
changelogs/unreleased/8132-mpryc
Outdated
@@ -0,0 +1 @@ | |||
Random race condition in the restore with managed fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpryc changelog needs to have PR number, not issue number -- s/8132/8133/ in the filename.
Also, I'd rephrase the changelog to describe the fix rather than the bug (since it will appear in release notes):
"Fixed race condition for conflicts on patching managed fields" or soething like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
This commit addresses issue vmware-tanzu#8132, where an error randomly appears in the logs during the restore operation. The error occurs due to a race condition when attempting to patch managed fields on an object that has been modified in the cluster. The error message indicates that the operation cannot be fulfilled because the object has been modified, suggesting that changes should be applied to the latest version. To resolve this, a retry mechanism has been implemented in the restore process when encountering this error, ensuring that managed fields are properly restored without the error message appearing in the logs. Signed-off-by: Michal Pryc <[email protected]>
097a9f8
to
b091d49
Compare
@mpryc The patch API should not report the error mentioned in issue #8132, the reason causes #8132 is that the So a more reasonable solution would be that make sure the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments
@ywk253100 won't this have possible other issues? There is a risk of applying a patch based on an out-of-sync version of the object. This can lead to unintended modifications, conflicts, or even overwriting of data that has been updated since the original |
@mpryc I checked the code again, seems the withoutManagedFields := createdObj.DeepCopy()
createdObj.SetManagedFields(obj.GetManagedFields())
patchBytes, err := generatePatch(withoutManagedFields, createdObj) I'm curious why the patch operation reported the confliction error? Only patch with |
@ywk253100 Any attempt to patch object in the cluster including other then When the patch is applied, it may conflict with the current state of the object, leading to a conflict error. To address this, it’s important to implement a retry mechanism. Specifically defined for such scenarios, I believe the correct fix is actually to perform retry on conflict which will ensure only this scenario is taken into consideration: |
Hi @mpryc If you see the code I pasted, the only difference between the two objects is the
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/ |
@ywk253100 I agree with you. While the chances of a conflict error due to the generated patch are minimal, they are still possible, as evidenced by the errors we see in our logs. I can't think of any other scenarios that could be causing this specific error. Implementing retries on this patch seems to be a reasonable approach to address the issue. The error Line 1681 in 3408ffe
|
@mpryc If the BTW, in which version of Velero and Kubernetes that you see the issue? Is it reproducible? |
@ywk253100 |
@mpryc That's OK for the first step of further debugging. BTW, seems the Is there any Please also check whether the conflict error is reported only for |
@shubham-pampattiwar please have a look for @mpryc when you have a moment. |
Thank you for contributing to Velero!
Please add a summary of your change
This commit addresses issue #8132, where an error randomly appears in the logs during the restore operation.
The error occurs due to a race condition when attempting to patch managed fields on an object that has been modified in the cluster. The error message indicates that the operation cannot be fulfilled because the object has been modified, suggesting that changes should be applied to the latest version.
To resolve this, a retry mechanism has been implemented in the restore process when encountering this error, ensuring that managed fields are properly restored without the error message appearing in the logs.
Does your change fix a particular issue?
Fixes #8132
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.