-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-32812: When newly built images rolled out, the update progress is not displaying correctly (went 0 --> 3) #4489
Conversation
@djoshy: This pull request references Jira Issue OCPBUGS-32812, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@djoshy: This pull request references Jira Issue OCPBUGS-32812, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1c90805
to
2951d96
Compare
/test e2e-gcp-op |
cd23b3f
to
84889cc
Compare
/test e2e-hypershift |
Verified using IPI on AWS We have checked that the updated nodes are correctly reported in the following scenarios:
Nevertheless, we have observed that when applying the OCL image the maxUnavailable value in the MCP is not honored, and we have seen that in pools without any maxUnavailable value (default 1), 2 nodes start applying the OCL image at the same time. We have tried to reproduce it in other clutsers without this fix. We have only been able to reproduce it once after trying many times. Nonetheless, in the clusters containing this fix this issue is observed basically 100% of the time. Since we were able to reproduce it once without this fix, we cannot assure that the problem is in this fix, but it seems to be related. Please, could you pay special attention to the behaviour that whe have described here while reviewing this code? If after the review it is decided that this PR has no impact in the behaviour related to maxUnavailable, we can approve this PR and create a different ticket to track the new issue Thanks. |
90dc2a9
to
bcc4dd9
Compare
I have a fix for this in my last push. Layered nodes were being incorrectly being marked ready for an update when they lacked the "current image" annotation on the node object(see |
It may be worth testing https://issues.redhat.com/browse/OCPBUGS-38869 on this PR as well. I'm not seeing the issue any longer in my testing, so it is possible some of the status rework has fixed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally makes sense. 1 question inline
/hold
To give Sergio a chance to re-verify (plus the https://issues.redhat.com/browse/OCPBUGS-38869)
// The MachineConfig annotations are loaded on boot-up by the daemon which | ||
// isn't currently done for the image annotations, so the comparisons here | ||
// are a bit more nuanced. | ||
cimage, cok := node.Annotations[daemonconsts.CurrentImageAnnotationKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me if any of these can exist but be "empty"? For example, if the pool is opted into layering, and then opted out of layering, do these get cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to opt a pool out of OCL yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment, no. There is a PR here for that piece: #4284
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reverted, the desiredImage
and currentImage
annotations should be cleared.
bcc4dd9
to
da754a2
Compare
/test all |
Verified using IPI on AWS We have checked that the updated nodes are correctly reported in the following scenarios:
In all cases the right order was used when applying the new configs/images, and the maxUnavailable value was honored (tested with values maxUnavailable=1 and maxUnavailable=2) Nevertheless, https://issues.redhat.com/browse/OCPBUGS-38869 is not fixed we can still see this issue in this PR MCN are not updated with the new config:
/label qe-approved |
@djoshy: This pull request references Jira Issue OCPBUGS-32812, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh /cherry-pick release-4.16 release-4.17 /unhold |
@djoshy: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@djoshy: This pull request references Jira Issue OCPBUGS-32812, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@djoshy: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/test e2e-aws-ovn-upgrade |
bade313
into
openshift:master
@djoshy: Jira Issue OCPBUGS-32812: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-32812 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@djoshy: #4489 failed to apply on top of branch "release-4.16":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick release-4.17 |
@djoshy: new pull request created: #4583 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
- What I did
- How to verify it
Follow the reproduction steps in the bug, there should not be any odd jumps/inaccuracies in the MCP status numbers for both the layered and non layered MC/image rollouts.