-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1913536: update.go: add broken symlink check + removal during unit enable #2338
Bug 1913536: update.go: add broken symlink check + removal during unit enable #2338
Conversation
@yuqi-zhang: This pull request references Bugzilla bug 1913536, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
@@ -1432,7 +1432,37 @@ func (dn *Daemon) enableUnits(units []string) error { | |||
args := append([]string{"enable"}, units...) | |||
stdouterr, err := exec.Command("systemctl", args...).CombinedOutput() | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we do a check for RHEL 7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's another option, I could hard check for RHEL7 and remove broken symlinks that we wrote originally
edit: added a RHEL check
// symlinks during enable. Do a best-effort removal of potentially broken | ||
// hard coded symlinks and try again. | ||
// See: https://bugzilla.redhat.com/show_bug.cgi?id=1913536 | ||
wantsPathSystemd := "/etc/systemd/system/multi-user.target.wants/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enable code does not assume multi-user.target.wants
; the unit could have a different target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this attempts to fix ONLY multi-user.target.wants
, because that's where we wrote broken symlinks back when we had hardcoding. This ONLY attempts to fix those, and ONLY in the list of enables, and other user error is not handled.
@@ -1432,7 +1432,37 @@ func (dn *Daemon) enableUnits(units []string) error { | |||
args := append([]string{"enable"}, units...) | |||
stdouterr, err := exec.Command("systemctl", args...).CombinedOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect of this is that any failure to enable the units will trigger the check. Since you are changing on the on-disk state, you may have to do a systemctl daemon-reload
.
Under the axiom that RHEL 7 can't handle null-target symlinks, I wonder if it would be simpler/safer to:
- check for symlinks in
/etc/systed/system
(since you can't assume the unit is just for multi-user.target.wants) - check if the target is dangling, remove it
- then systemctl-enable
- guard for RHEL 7 only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect of this is that any failure to enable the units will trigger the check. Since you are changing on the on-disk state, you may have to do a systemctl daemon-reload.
Right, and I propagate the old error just in case. Put another way I am just performing one extra "cleanup" when I fail the first time, so normal operation specifically doesn't hit this.
check for symlinks in /etc/systed/system (since you can't assume the unit is just for multi-user.target.wants)
I don't want to do that because then we'd be potentially managing items not defined in a machineconfig or correct user error which I do not want to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, if user error occurs, I still want to error as usual. I don't want this to actually fix incorrect unit definitions, for example, or if a user wrote a bad link with a files
section config. I am only attempting to clean up bad links from our own code pre 4.7. I can try to make that clearer in the commit message if you think the approach makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be ok if the logic was the same, but I only attempted this on RHEL7? Or if I parse the error specifically for "file exists" which is what the rhel 7 bug was? (somewhat hacky I feel)
/retest |
Added a check for rhel |
/retest |
If unit enable fails, remove broken symlinks in multi-user.target.wants and try again. This fixes a bug where enables would fail on cluster upgrades with RHEL 7 nodes between 4.6 -> 4.7. Context: before openshift#2145, the MCO hard coded a symlink from /etc/system/systemd/$UNIT to /etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd version is newer and is able to remove broken symlinks, but on RHEL 7 nodes, it will not first attempt to remove broken symlinks and thus fails the enable. As a workaround, this PR thus attempts to remove broken symlinks when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades should not hit this, and failing ones would report full errors. The error checking is perhaps a bit overkill but the original bug case should only run through this logic once before it is fixed. Future errors are likely actual errors and will be reported as such. Signed-off-by: Yu Qi Zhang <[email protected]>
b4b84cb
to
8e90105
Compare
Rebased, all tests passed, tested locally with 4.6.6->4.6.11-> 4.7 nightly with this PR + rhel node, all errors were automatically fixed, upgrade was successful, and no dangling symlinks remain. This should be good to go (assuming we're ok with the above assumptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
Will leave final approval on Ben.
/retest |
/approve After talking to Jerry offline, looks good to me. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: darkmuggle, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@yuqi-zhang: All pull requests linked via external trackers have merged: Bugzilla bug 1913536 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
If unit enable fails, remove broken symlinks in multi-user.target.wants
and try again. This fixes a bug where enables would fail on cluster upgrades
with RHEL 7 nodes between 4.6 -> 4.7.
Context: before #2145,
the MCO hard coded a symlink from /etc/system/systemd/$UNIT to
/etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case
for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd
version is newer and is able to remove broken symlinks, but on RHEL 7 nodes,
it will not first attempt to remove broken symlinks and thus fails the
enable. As a workaround, this PR thus attempts to remove broken symlinks
when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades
should not hit this, and failing ones would report full errors.
The error checking is perhaps a bit overkill but the original bug case should
only run through this logic once before it is fixed. Future errors are likely
actual errors and will be reported as such.
Signed-off-by: Yu Qi Zhang [email protected]