Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1885365: daemon: properly handle unit enable/disables #2145

Conversation

yuqi-zhang
Copy link
Contributor

Remove the hardlink to multi-user.target and instead invoke systemctl
to directly enable/disable units as needed, so [Install] - wantedby
sections are properly parsed.

Also call systemctl preset on units that are deleted/changed, to
ensure "rollback" of unit enablement to system defaults. Remove
the Ignition-written preset file to account for this.

@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: This pull request references Bugzilla bug 1885365, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1885365: daemon: properly handle unit enable/disables

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 7, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020
@yuqi-zhang yuqi-zhang requested a review from bgilbert October 7, 2020 23:14
@yuqi-zhang
Copy link
Contributor Author

need to fix up a bit more but general idea is up for review

Copy link
Contributor

@bgilbert bgilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally.

pkg/daemon/daemon.go Outdated Show resolved Hide resolved
@yuqi-zhang yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch 4 times, most recently from 8ca3f00 to e27e04e Compare October 8, 2020 19:38
@yuqi-zhang
Copy link
Contributor Author

yuqi-zhang commented Oct 8, 2020

Tested with the following:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 98-test-service
spec:
  config:
    ignition:
      version: 3.1.0
    systemd:
      units:
        - name: test.service
          contents: |
            [Unit]
            Description=Hello Test Service
            After=network-online.target
            [Service]
            Type=simple
            ExecStart=/usr/bin/echo hello
            [Install]
            WantedBy=network-online.target

and

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 98-test-service-2
spec:
  config:
    ignition:
      version: 3.1.0
    systemd:
      units:
        - name: test.service
          enabled: true

Adding the first correctly adds the service
Adding the second correctly enables and runs the service upon reboot, linking to network-online.service
Removing the second correctly disables the service

Also tested without this PR, the above would link to multi-user.target.wants instead (incorrect) and does not get disabled when you remove the second MC (incorrect)

@yuqi-zhang
Copy link
Contributor Author

/skip

@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test cluster-bootimages
  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-proxy
  • /test e2e-aws-workers-rhel7
  • /test e2e-azure
  • /test e2e-gcp-op
  • /test e2e-metal-ipi
  • /test e2e-openstack
  • /test e2e-ovirt
  • /test e2e-ovn-step-registry
  • /test e2e-upgrade
  • /test e2e-vsphere
  • /test e2e-vsphere-upi
  • /test images
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-images
  • /test unit
  • /test verify

Use /test all to run the following jobs:

  • pull-ci-openshift-machine-config-operator-master-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
  • pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
  • pull-ci-openshift-machine-config-operator-master-e2e-upgrade
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ci/prow/e2e-gcp-op

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuqi-zhang
Copy link
Contributor Author

/test e2e-gcp-op

pkg/daemon/update.go Outdated Show resolved Hide resolved
@yuqi-zhang
Copy link
Contributor Author

So there's a small issue with this approach: there appears to be roughly a ~35% increase in operation time due to the extra writes and calls to systemctl. At the very least this times out our GCP-OP tests.

Our options then seem to be:

  1. incur the performance penalty and bump test timeouts
  2. go back to writing the files by hand, and add a parser ourselves to find the install section of the unit, when its enabled, and link based on that instead of hardcoding links

@cgwalters
Copy link
Member

Today the MCO rewrites files even if they didn't change; if we fixed that it'd be a general improvement and also fix that bug I think.

@yuqi-zhang yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch from e27e04e to b22cbf3 Compare October 9, 2020 14:36
@yuqi-zhang
Copy link
Contributor Author

Today the MCO rewrites files even if they didn't change; if we fixed that it'd be a general improvement and also fix that bug I think.

I was under the impression that this was somewhat intentional. Although inefficient, this has some benefits e.g. allowing us to use the forcefile option to re-write all the configs, in case of a drift-degraded situation.

We could consider that separately. I also don't see how it would fix this issue, as the fundamental problem is a wrong hardcode + no restoration of system defaults

@bgilbert
Copy link
Contributor

bgilbert commented Oct 9, 2020

Hand-implementing [Install] would require handling or intentionally ignoring various corner cases: Alias, Also, and DefaultInstance for templated units. I wouldn't recommend it.

systemctl {enable,disable,preset} support specifying multiple units on the command line. You could accumulate three lists of units to configure, and then at the end, run systemctl exactly once per category of change.

pkg/daemon/update.go Outdated Show resolved Hide resolved
pkg/daemon/update.go Show resolved Hide resolved
@yuqi-zhang yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch 2 times, most recently from 6b526da to 78b0fd3 Compare October 13, 2020 22:06
Copy link
Contributor

@bgilbert bgilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dcbw
Copy link
Contributor

dcbw commented Dec 2, 2020

/retest

Remove the hardlink to multi-user.target and instead invoke systemctl
to directly enable/disable units as needed, so [Install] - wantedby
sections are properly parsed.

Also call systemctl preset on units that are deleted/changed, to
ensure "rollback" of unit enablement to system defaults. Remove
the Ignition-written preset file to account for this.

Signed-off-by: Yu Qi Zhang <[email protected]>
With the introduction of shelling out to systemctl, the MCO e2e
tests incur a ~35% penalty to speed. Try to mitigate this by
collecting all enabled/disabled units and writing them in one call.

Don't do this for presets, since we don't want to fail for any
individual preset command failing.

Signed-off-by: Yu Qi Zhang <[email protected]>
@yuqi-zhang yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch from e16110a to b9e06c3 Compare December 8, 2020 22:22
@yuqi-zhang
Copy link
Contributor Author

Rebased, will see whether slowdowns are still seen in gcp-op tests

@sinnykumari
Copy link
Contributor

/retest

@yuqi-zhang
Copy link
Contributor Author

Comparing runtimes of most recent gcp-op tests from other PRs, I don't see a significant difference in runtime for the tests that passed. Will retest a few more times after gcp-op issues are fixed.

@yuqi-zhang
Copy link
Contributor Author

Tested again with the steps in #2145 (comment). Once we're confident in GCP-OP not being slowed down by this, I'm ready to have this merge

@yuqi-zhang
Copy link
Contributor Author

/retest

4 similar comments
@yuqi-zhang
Copy link
Contributor Author

/retest

@yuqi-zhang
Copy link
Contributor Author

/retest

@yuqi-zhang
Copy link
Contributor Author

/retest

@yuqi-zhang
Copy link
Contributor Author

/retest

@darkmuggle
Copy link
Contributor

/approve
/lgtm

Reading over the PR and the code, this LGTM.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bgilbert, darkmuggle, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yuqi-zhang
Copy link
Contributor Author

yuqi-zhang commented Dec 14, 2020

One more test to be safe. I seem some minor slowdowns but I think they are unrelated to this PR. Overall seems to be matching expectations

/test ci/prow/e2e-gcp-op

@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test cluster-bootimages
  • /test e2e-agnostic-upgrade
  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-workers-rhel7
  • /test e2e-azure
  • /test e2e-gcp-op
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-openstack
  • /test e2e-ovirt
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upi
  • /test images
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-images
  • /test unit
  • /test verify

Use /test all to run the following jobs:

  • pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-serial
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
  • pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

One more test to be safe

/test ci/prow/e2e-gcp-op

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuqi-zhang
Copy link
Contributor Author

/test e2e-gcp-op

@yuqi-zhang
Copy link
Contributor Author

Times look fine
/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 7b683e3 into openshift:master Dec 15, 2020
@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: All pull requests linked via external trackers have merged:

Bugzilla bug 1885365 has been moved to the MODIFIED state.

In response to this:

Bug 1885365: daemon: properly handle unit enable/disables

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang added a commit to yuqi-zhang/machine-config-operator that referenced this pull request Jan 15, 2021
If unit enable fails, remove broken symlinks in multi-user.target.wants
and try again. This fixes a bug where enables would fail on cluster upgrades
with RHEL 7 nodes between 4.6 -> 4.7.

Context: before openshift#2145,
the MCO hard coded a symlink from /etc/system/systemd/$UNIT to
/etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case
for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd
version is newer and is able to remove broken symlinks, but on RHEL 7 nodes,
it will not first attempt to remove broken symlinks and thus fails the
enable. As a workaround, this PR thus attempts to remove broken symlinks
when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades
should not hit this, and failing ones would report full errors.

The error checking is perhaps a bit overkill but the original bug case should
only run through this logic once before it is fixed. Future errors are likely
actual errors and will be reported as such.

Signed-off-by: Yu Qi Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. team-mco
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants