Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1913536: update.go: add broken symlink check + removal during unit enable #2338

Merged

Conversation

yuqi-zhang
Copy link
Contributor

If unit enable fails, remove broken symlinks in multi-user.target.wants
and try again. This fixes a bug where enables would fail on cluster upgrades
with RHEL 7 nodes between 4.6 -> 4.7.

Context: before #2145,
the MCO hard coded a symlink from /etc/system/systemd/$UNIT to
/etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case
for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd
version is newer and is able to remove broken symlinks, but on RHEL 7 nodes,
it will not first attempt to remove broken symlinks and thus fails the
enable. As a workaround, this PR thus attempts to remove broken symlinks
when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades
should not hit this, and failing ones would report full errors.

The error checking is perhaps a bit overkill but the original bug case should
only run through this logic once before it is fixed. Future errors are likely
actual errors and will be reported as such.

Signed-off-by: Yu Qi Zhang [email protected]

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Jan 13, 2021
@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: This pull request references Bugzilla bug 1913536, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1913536: update.go: add broken symlink check + removal during unit enable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 13, 2021
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2021
@yuqi-zhang
Copy link
Contributor Author

/retest

@@ -1432,7 +1432,37 @@ func (dn *Daemon) enableUnits(units []string) error {
args := append([]string{"enable"}, units...)
stdouterr, err := exec.Command("systemctl", args...).CombinedOutput()
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do a check for RHEL 7?

Copy link
Contributor Author

@yuqi-zhang yuqi-zhang Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's another option, I could hard check for RHEL7 and remove broken symlinks that we wrote originally
edit: added a RHEL check

// symlinks during enable. Do a best-effort removal of potentially broken
// hard coded symlinks and try again.
// See: https://bugzilla.redhat.com/show_bug.cgi?id=1913536
wantsPathSystemd := "/etc/systemd/system/multi-user.target.wants/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable code does not assume multi-user.target.wants; the unit could have a different target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this attempts to fix ONLY multi-user.target.wants, because that's where we wrote broken symlinks back when we had hardcoding. This ONLY attempts to fix those, and ONLY in the list of enables, and other user error is not handled.

@@ -1432,7 +1432,37 @@ func (dn *Daemon) enableUnits(units []string) error {
args := append([]string{"enable"}, units...)
stdouterr, err := exec.Command("systemctl", args...).CombinedOutput()
Copy link
Contributor

@darkmuggle darkmuggle Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of this is that any failure to enable the units will trigger the check. Since you are changing on the on-disk state, you may have to do a systemctl daemon-reload.

Under the axiom that RHEL 7 can't handle null-target symlinks, I wonder if it would be simpler/safer to:

  • check for symlinks in /etc/systed/system (since you can't assume the unit is just for multi-user.target.wants)
  • check if the target is dangling, remove it
  • then systemctl-enable
  • guard for RHEL 7 only

Copy link
Contributor Author

@yuqi-zhang yuqi-zhang Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of this is that any failure to enable the units will trigger the check. Since you are changing on the on-disk state, you may have to do a systemctl daemon-reload.

Right, and I propagate the old error just in case. Put another way I am just performing one extra "cleanup" when I fail the first time, so normal operation specifically doesn't hit this.

check for symlinks in /etc/systed/system (since you can't assume the unit is just for multi-user.target.wants)

I don't want to do that because then we'd be potentially managing items not defined in a machineconfig or correct user error which I do not want to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if user error occurs, I still want to error as usual. I don't want this to actually fix incorrect unit definitions, for example, or if a user wrote a bad link with a files section config. I am only attempting to clean up bad links from our own code pre 4.7. I can try to make that clearer in the commit message if you think the approach makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be ok if the logic was the same, but I only attempted this on RHEL7? Or if I parse the error specifically for "file exists" which is what the rhel 7 bug was? (somewhat hacky I feel)

@yuqi-zhang
Copy link
Contributor Author

/retest

@yuqi-zhang
Copy link
Contributor Author

Added a check for rhel

@yuqi-zhang
Copy link
Contributor Author

/retest

If unit enable fails, remove broken symlinks in multi-user.target.wants
and try again. This fixes a bug where enables would fail on cluster upgrades
with RHEL 7 nodes between 4.6 -> 4.7.

Context: before openshift#2145,
the MCO hard coded a symlink from /etc/system/systemd/$UNIT to
/etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case
for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd
version is newer and is able to remove broken symlinks, but on RHEL 7 nodes,
it will not first attempt to remove broken symlinks and thus fails the
enable. As a workaround, this PR thus attempts to remove broken symlinks
when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades
should not hit this, and failing ones would report full errors.

The error checking is perhaps a bit overkill but the original bug case should
only run through this logic once before it is fixed. Future errors are likely
actual errors and will be reported as such.

Signed-off-by: Yu Qi Zhang <[email protected]>
@yuqi-zhang yuqi-zhang force-pushed the fix-systemctl-enable-rhel7 branch from b4b84cb to 8e90105 Compare January 15, 2021 22:31
@yuqi-zhang
Copy link
Contributor Author

Rebased, all tests passed, tested locally with 4.6.6->4.6.11-> 4.7 nightly with this PR + rhel node, all errors were automatically fixed, upgrade was successful, and no dangling symlinks remain. This should be good to go (assuming we're ok with the above assumptions)

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.
Will leave final approval on Ben.

@sinnykumari
Copy link
Contributor

/retest

@darkmuggle
Copy link
Contributor

/approve
/lgtm

After talking to Jerry offline, looks good to me.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: darkmuggle, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [darkmuggle,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 700dfc8 into openshift:master Jan 26, 2021
@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: All pull requests linked via external trackers have merged:

Bugzilla bug 1913536 has been moved to the MODIFIED state.

In response to this:

Bug 1913536: update.go: add broken symlink check + removal during unit enable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants