Bug 1885365: daemon: properly handle unit enable/disables #2145

yuqi-zhang · 2020-10-07T23:14:00Z

Remove the hardlink to multi-user.target and instead invoke systemctl
to directly enable/disable units as needed, so [Install] - wantedby
sections are properly parsed.

Also call systemctl preset on units that are deleted/changed, to
ensure "rollback" of unit enablement to system defaults. Remove
the Ignition-written preset file to account for this.

openshift-ci-robot · 2020-10-07T23:14:05Z

@yuqi-zhang: This pull request references Bugzilla bug 1885365, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1885365: daemon: properly handle unit enable/disables

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2020-10-08T03:31:13Z

need to fix up a bit more but general idea is up for review

bgilbert

LGTM generally.

pkg/daemon/daemon.go

yuqi-zhang · 2020-10-08T20:09:41Z

Tested with the following:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 98-test-service
spec:
  config:
    ignition:
      version: 3.1.0
    systemd:
      units:
        - name: test.service
          contents: |
            [Unit]
            Description=Hello Test Service
            After=network-online.target
            [Service]
            Type=simple
            ExecStart=/usr/bin/echo hello
            [Install]
            WantedBy=network-online.target

and

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 98-test-service-2
spec:
  config:
    ignition:
      version: 3.1.0
    systemd:
      units:
        - name: test.service
          enabled: true

Adding the first correctly adds the service
Adding the second correctly enables and runs the service upon reboot, linking to network-online.service
Removing the second correctly disables the service

Also tested without this PR, the above would link to multi-user.target.wants instead (incorrect) and does not get disabled when you remove the second MC (incorrect)

yuqi-zhang · 2020-10-08T21:25:06Z

/skip

openshift-ci-robot · 2020-10-08T21:50:42Z

@yuqi-zhang: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test cluster-bootimages
/test e2e-aws
/test e2e-aws-disruptive
/test e2e-aws-proxy
/test e2e-aws-workers-rhel7
/test e2e-azure
/test e2e-gcp-op
/test e2e-metal-ipi
/test e2e-openstack
/test e2e-ovirt
/test e2e-ovn-step-registry
/test e2e-upgrade
/test e2e-vsphere
/test e2e-vsphere-upi
/test images
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-images
/test unit
/test verify

Use /test all to run the following jobs:

pull-ci-openshift-machine-config-operator-master-e2e-aws
pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
pull-ci-openshift-machine-config-operator-master-e2e-upgrade
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-e2e-aws
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ci/prow/e2e-gcp-op

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2020-10-08T21:50:47Z

/test e2e-gcp-op

pkg/daemon/update.go

yuqi-zhang · 2020-10-09T14:20:15Z

So there's a small issue with this approach: there appears to be roughly a ~35% increase in operation time due to the extra writes and calls to systemctl. At the very least this times out our GCP-OP tests.

Our options then seem to be:

incur the performance penalty and bump test timeouts
go back to writing the files by hand, and add a parser ourselves to find the install section of the unit, when its enabled, and link based on that instead of hardcoding links

cgwalters · 2020-10-09T14:23:22Z

Today the MCO rewrites files even if they didn't change; if we fixed that it'd be a general improvement and also fix that bug I think.

yuqi-zhang · 2020-10-09T14:46:34Z

Today the MCO rewrites files even if they didn't change; if we fixed that it'd be a general improvement and also fix that bug I think.

I was under the impression that this was somewhat intentional. Although inefficient, this has some benefits e.g. allowing us to use the forcefile option to re-write all the configs, in case of a drift-degraded situation.

We could consider that separately. I also don't see how it would fix this issue, as the fundamental problem is a wrong hardcode + no restoration of system defaults

bgilbert · 2020-10-09T15:34:30Z

Hand-implementing [Install] would require handling or intentionally ignoring various corner cases: Alias, Also, and DefaultInstance for templated units. I wouldn't recommend it.

systemctl {enable,disable,preset} support specifying multiple units on the command line. You could accumulate three lists of units to configure, and then at the end, run systemctl exactly once per category of change.

pkg/daemon/update.go

bgilbert

LGTM

dcbw · 2020-12-02T20:02:04Z

/retest

Remove the hardlink to multi-user.target and instead invoke systemctl to directly enable/disable units as needed, so [Install] - wantedby sections are properly parsed. Also call systemctl preset on units that are deleted/changed, to ensure "rollback" of unit enablement to system defaults. Remove the Ignition-written preset file to account for this. Signed-off-by: Yu Qi Zhang <[email protected]>

With the introduction of shelling out to systemctl, the MCO e2e tests incur a ~35% penalty to speed. Try to mitigate this by collecting all enabled/disabled units and writing them in one call. Don't do this for presets, since we don't want to fail for any individual preset command failing. Signed-off-by: Yu Qi Zhang <[email protected]>

yuqi-zhang · 2020-12-08T22:22:47Z

Rebased, will see whether slowdowns are still seen in gcp-op tests

sinnykumari · 2020-12-09T10:45:12Z

/retest

yuqi-zhang · 2020-12-09T20:16:33Z

Comparing runtimes of most recent gcp-op tests from other PRs, I don't see a significant difference in runtime for the tests that passed. Will retest a few more times after gcp-op issues are fixed.

yuqi-zhang · 2020-12-09T23:42:17Z

Tested again with the steps in #2145 (comment). Once we're confident in GCP-OP not being slowed down by this, I'm ready to have this merge

yuqi-zhang · 2020-12-10T20:15:16Z

/retest

yuqi-zhang · 2020-12-10T22:18:55Z

/retest

yuqi-zhang · 2020-12-14T17:00:44Z

/retest

yuqi-zhang · 2020-12-14T17:51:46Z

/retest

yuqi-zhang · 2020-12-14T18:43:33Z

/retest

darkmuggle · 2020-12-14T18:58:59Z

/approve
/lgtm

Reading over the PR and the code, this LGTM.

openshift-ci-robot · 2020-12-14T18:59:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bgilbert, darkmuggle, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yuqi-zhang · 2020-12-14T21:51:48Z

One more test to be safe. I seem some minor slowdowns but I think they are unrelated to this PR. Overall seems to be matching expectations

/test ci/prow/e2e-gcp-op

openshift-ci-robot · 2020-12-14T21:52:07Z

@yuqi-zhang: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test cluster-bootimages
/test e2e-agnostic-upgrade
/test e2e-aws
/test e2e-aws-disruptive
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-workers-rhel7
/test e2e-azure
/test e2e-gcp-op
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-openstack
/test e2e-ovirt
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upi
/test images
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-images
/test unit
/test verify

Use /test all to run the following jobs:

pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-aws
pull-ci-openshift-machine-config-operator-master-e2e-aws-serial
pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-e2e-aws
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

One more test to be safe

/test ci/prow/e2e-gcp-op

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2020-12-14T21:52:25Z

/test e2e-gcp-op

yuqi-zhang · 2020-12-15T04:17:31Z

Times look fine
/hold cancel

openshift-bot · 2020-12-15T10:14:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-12-15T12:04:28Z

@yuqi-zhang: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#2145

Bugzilla bug 1885365 has been moved to the MODIFIED state.

In response to this:

Bug 1885365: daemon: properly handle unit enable/disables

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

If unit enable fails, remove broken symlinks in multi-user.target.wants and try again. This fixes a bug where enables would fail on cluster upgrades with RHEL 7 nodes between 4.6 -> 4.7. Context: before openshift#2145, the MCO hard coded a symlink from /etc/system/systemd/$UNIT to /etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd version is newer and is able to remove broken symlinks, but on RHEL 7 nodes, it will not first attempt to remove broken symlinks and thus fails the enable. As a workaround, this PR thus attempts to remove broken symlinks when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades should not hit this, and failing ones would report full errors. The error checking is perhaps a bit overkill but the original bug case should only run through this logic once before it is fixed. Future errors are likely actual errors and will be reported as such. Signed-off-by: Yu Qi Zhang <[email protected]>

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 7, 2020

yuqi-zhang mentioned this pull request Oct 7, 2020

WIP: Bug 1885365: update.go: use systemd presets to enable/disable units #2142

Closed

openshift-ci-robot requested review from cgwalters and sinnykumari October 7, 2020 23:14

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2020

yuqi-zhang requested a review from bgilbert October 7, 2020 23:14

bgilbert reviewed Oct 8, 2020

View reviewed changes

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

vrutkovs mentioned this pull request Oct 8, 2020

[Tracking] OKD 4.6 Stable okd-project/okd#348

Closed

6 tasks

yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch 4 times, most recently from 8ca3f00 to e27e04e Compare October 8, 2020 19:38

bgilbert reviewed Oct 8, 2020

View reviewed changes

pkg/daemon/update.go Outdated Show resolved Hide resolved

yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch from e27e04e to b22cbf3 Compare October 9, 2020 14:36

sinnykumari mentioned this pull request Oct 12, 2020

MCO: add OKD tests for release-4.6 and release-4.7 branches openshift/release#12683

Merged

bgilbert reviewed Oct 13, 2020

View reviewed changes

pkg/daemon/update.go Outdated Show resolved Hide resolved

pkg/daemon/update.go Show resolved Hide resolved

yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch 2 times, most recently from 6b526da to 78b0fd3 Compare October 13, 2020 22:06

bgilbert approved these changes Oct 13, 2020

View reviewed changes

yuqi-zhang added 2 commits December 8, 2020 17:16

yuqi-zhang force-pushed the fix-systemd-unit-enable-disable branch from e16110a to b9e06c3 Compare December 8, 2020 22:22

cgwalters mentioned this pull request Dec 9, 2020

likely oscontainer regression coreos/coreos-assembler#1961

Closed

openshift-ci-robot assigned darkmuggle Dec 14, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2020

openshift-merge-robot merged commit 7b683e3 into openshift:master Dec 15, 2020

YuviGold mentioned this pull request Dec 16, 2020

Support MCD running systemctl commands openshift/assisted-installer#145

Closed

yuqi-zhang mentioned this pull request Jan 13, 2021

Bug 1913536: update.go: add broken symlink check + removal during unit enable #2338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1885365: daemon: properly handle unit enable/disables #2145

Bug 1885365: daemon: properly handle unit enable/disables #2145

yuqi-zhang commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

yuqi-zhang commented Oct 8, 2020

bgilbert left a comment

yuqi-zhang commented Oct 8, 2020 •

edited

Loading

yuqi-zhang commented Oct 8, 2020

openshift-ci-robot commented Oct 8, 2020

yuqi-zhang commented Oct 8, 2020

yuqi-zhang commented Oct 9, 2020

cgwalters commented Oct 9, 2020

yuqi-zhang commented Oct 9, 2020

bgilbert commented Oct 9, 2020

bgilbert left a comment

dcbw commented Dec 2, 2020

yuqi-zhang commented Dec 8, 2020

sinnykumari commented Dec 9, 2020

yuqi-zhang commented Dec 9, 2020

yuqi-zhang commented Dec 9, 2020

yuqi-zhang commented Dec 10, 2020

yuqi-zhang commented Dec 10, 2020

yuqi-zhang commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020

darkmuggle commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020 •

edited

Loading

openshift-ci-robot commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020

yuqi-zhang commented Dec 15, 2020

openshift-bot commented Dec 15, 2020

openshift-ci-robot commented Dec 15, 2020

Bug 1885365: daemon: properly handle unit enable/disables #2145

Bug 1885365: daemon: properly handle unit enable/disables #2145

Conversation

yuqi-zhang commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

yuqi-zhang commented Oct 8, 2020

bgilbert left a comment

Choose a reason for hiding this comment

yuqi-zhang commented Oct 8, 2020 • edited Loading

yuqi-zhang commented Oct 8, 2020

openshift-ci-robot commented Oct 8, 2020

yuqi-zhang commented Oct 8, 2020

yuqi-zhang commented Oct 9, 2020

cgwalters commented Oct 9, 2020

yuqi-zhang commented Oct 9, 2020

bgilbert commented Oct 9, 2020

bgilbert left a comment

Choose a reason for hiding this comment

dcbw commented Dec 2, 2020

yuqi-zhang commented Dec 8, 2020

sinnykumari commented Dec 9, 2020

yuqi-zhang commented Dec 9, 2020

yuqi-zhang commented Dec 9, 2020

yuqi-zhang commented Dec 10, 2020

yuqi-zhang commented Dec 10, 2020

yuqi-zhang commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020

darkmuggle commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020 • edited Loading

openshift-ci-robot commented Dec 14, 2020

yuqi-zhang commented Dec 14, 2020

yuqi-zhang commented Dec 15, 2020

openshift-bot commented Dec 15, 2020

openshift-ci-robot commented Dec 15, 2020

yuqi-zhang commented Oct 8, 2020 •

edited

Loading

yuqi-zhang commented Dec 14, 2020 •

edited

Loading