MCO-694: revert from layered pool to non-layered pool #4284

cheesesashimi · 2024-03-26T16:21:26Z

- What I did

This adds code that reverts from a layered MachineConfigPool to a non-layered MachineConfigPool.

Why this was so troublesome is:

When a MachineConfig is written to the node, it is placed in the portions of the filesystem that are mutable according to ostree.
When a container image containing those MachineConfigs is written onto the node using rpm-ostree, it technically overwrites those preexisting MachineConfigs. In doing so, the container is now claiming (for lack of a better term) ownership of those files.
The "factory" OS image does not contain these MachineConfigs.
So when we roll back from the customized image to the "factory" image, because the MachineConfig files on disk are now owned by the customized container, they are removed when the factory OS image is rebased.

If an ad-hoc file is written to a mutable part of the filesystem after the container has been applied, provided that the container does not claim ownership of a file with the same name, the ad-hoc file will persist after a reboot. To take full advantage of this fact, this PR does the following:

Introduces a new subpackage called pkg/daemon/runtimeassets. The purpose of this package is to house any configs or templates that need to be applied to a node during runtime but should not be part of the clusters MachineConfig. There is the potential for this to be used by the certificate writer path in the future.
Introduces a machine-config-daemon-revert.service systemd service which is only rendered, written to the node , and enabled whenever a revert operation is being done.
After the file is written to the nodes' filesystem, the node reboots.
During the bootup, the new service detects the presence of /etc/mco/machineconfig-revert.json and runs the MCD in bootstrap mode to rewrite all of the configs to disk. This (unfortunately) requires a second node reboot.
Following the second node reboot, the node should be in the reverted configuration.

- How to verify it

Bring up an OpenShift cluster for this PR.
Opt into on-cluster builds. My onclustertesting helper can be used to assist with that; just run $ onclustertesting setup --enable-feature-gate --pool=layered in-cluster-registry.
Wait for the image to finish building.
Add a node to the layered MachineConfigPool: $ oc label node/<nodename> 'node-role.kubernetes.io/layered='
Wait for the node to deploy the built image.
Remove the label from the layered MachineConfigPool: $ oc label node/<nodename> 'node-role.kubernetes.io/layered-'
Wait for the node to revert back to the worker MachineConfigPool.

- Description for the changelog
Allows reverting from layered MachineConfigPool to non-layered MachineConfigPool

openshift-ci-robot · 2024-03-26T16:21:31Z

@cheesesashimi: This pull request references MCO-694 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

- What I did

This adds code that reverts from a layered MachineConfigPool to a non-layered MachineConfigPool.

Why this was so troublesome is:

When a MachineConfig is written to the node, it is placed in the portions of the filesystem that are mutable according to ostree.

When a container image containing those MachineConfigs is written onto the node using rpm-ostree, it technically overwrites those preexisting MachineConfigs. In doing so, the container is now claiming (for lack of a better term) ownership of those files.

The "factory" OS image does not contain these MachineConfigs.

So when we roll back from the customized image to the "factory" image, because the MachineConfig files on disk are now owned by the customized container, they are removed when the factory OS image is rebased.

If an ad-hoc file is written to a mutable part of the filesystem after the container has been applied, provided that the container does not claim ownership of a file with the same name, the ad-hoc file will persist after a reboot. To take full advantage of this fact, this PR does the following:

Introduces a machine-config-daemon-revert.service systemd service, which is disabled by default. The contents of this are similar to the machine-config-daemon-firstboot.service, with the exception being that it is required by a default system target.

In the event that a revert is detected, this file is cloned to a different service name in the systemd root (/etc/systemd/system).

The systemd service is then enabled, the new MachineConfig is written to disk under /etc/ignition-machine-config-encapsulated.json.

The node reboots.

During the bootup, the new service detects the presence of /etc/ignition-machine-config-encapsulated.json and runs the MCD in bootstrap mode to rewrite all of the configs to disk. This (unfortunately) includes a second node reboot.

Following the second node reboot, the node should be in the reverted configuration.

- How to verify it

Bring up an OpenShift cluster for this PR.

Opt into on-cluster builds. My onclustertesting helper can be used to assist with that; just run $ onclustertesting setup --enable-feature-gate --pool=layered in-cluster-registry.

Wait for the image to finish building.

Add a node to the layered MachineConfigPool: $ oc label node/<nodename> 'node-role.kubernetes.io/layered='

Wait for the node to deploy the built image.

Remove the label from the layered MachineConfigPool: $ oc label node/<nodename> 'node-role.kubernetes.io/layered-'

Wait for the node to revert back to the worker MachineConfigPool.

- Description for the changelog
Allows reverting from layered MachineConfigPool to non-layered MachineConfigPool

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-03-26T16:21:55Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cheesesashimi · 2024-03-26T16:22:15Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

cgwalters · 2024-03-26T20:29:43Z

One thing we could investigate is something a bit like #1190 where we avoid mutating the system's /etc and explicitly make a new bootloader entry.

cheesesashimi · 2024-03-29T14:14:52Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

openshift-bot · 2024-06-28T01:01:16Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

cheesesashimi · 2024-07-10T20:43:52Z

/remove-lifecycle stale

cheesesashimi · 2024-07-12T14:06:29Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

cheesesashimi · 2024-07-15T16:44:20Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

cheesesashimi · 2024-07-15T21:01:59Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

cheesesashimi · 2024-07-15T21:07:50Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

cheesesashimi · 2024-07-16T16:57:31Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

cheesesashimi · 2024-07-16T19:46:23Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

cheesesashimi · 2024-07-17T13:59:22Z

/test e2e-gcp-op
/test e2e-gcp-op-techpreview

yuqi-zhang

Really liking the new runtimeassests method! Thanks for adapting the PR!

/lgtm
/hold

Holding for:

QE pre-merge approval
removal of test code after MCO-703: Lifecycle Buildah with MCO #4471 merges

yuqi-zhang · 2024-08-01T14:57:58Z

pkg/daemon/update.go

+	// If the new OS image equals the OS image URL value, this means we're in a
+	// revert-from-layering situation. This also means we can return early after
+	// taking a different path.
+	if newImage == newConfig.Spec.OSImageURL {


I guess I originally thought that a user can also set the image back by hand, but that should be fine here as well.

yuqi-zhang · 2024-08-01T17:15:07Z

test/e2e-techpreview/onclusterbuild_test.go

@@ -456,6 +474,9 @@ func prepareForTest(t *testing.T, cs *framework.ClientSet, testOpts onClusterBui
 	pushSecretName, err := getBuilderPushSecretName(cs)
 	require.NoError(t, err)

+	// REMOVE AFTER https://github.com/openshift/machine-config-operator/pull/4471 LANDS!


I think we can land that first since it's mostly ready to go

yuqi-zhang · 2024-08-28T20:09:40Z

Going to remove the hold now that #4471 has landed. @cheesesashimi could you rebase when you get a chance?

/hold cancel

This adds code that reverts from a layered MachineConfigPool to a non-layered MachineConfigPool. Why this was so troublesome is: - When a MachineConfig is written to the node, it is placed in the portions of the filesystem that are mutable according to ostree. - When a container image containing those MachineConfigs is written onto the node using rpm-ostree, it technically overwrites those preexisting MachineConfigs. In doing so, the container is now claiming (for lack of a better term) ownership of those files. - The "factory" OS image does not contain these MachineConfigs. - So when we roll back from the customized image to the "factory" image, because the MachineConfig files on disk are now owned by the customized container, they are removed when the factory OS image is rebased. If an ad-hoc file is written to a mutable part of the filesystem after the container has been applied, provided that the container does not claim ownership of a file with the same name, the ad-hoc file will persist after a reboot. To take full advantage of this fact, this PR does the following: 1. Introduces a new subpackage called `pkg/daemon/runtimeassets`. The purpose of this package is to house any configs or templates that need to be applied to a node during runtime but should not be part of the clusters MachineConfig. There is the potential for this to be used by the certificate writer path in the future. 2. Introduces a `machine-config-daemon-revert.service` systemd service which is only rendered, written to the node , and enabled whenever a revert operation is being done. 3. After the file is written to the nodes' filesystem, the node reboots. 4. During the bootup, the new service detects the presence of `/etc/mco/machineconfig-revert.json` and runs the MCD in bootstrap mode to rewrite all of the configs to disk. This (unfortunately) requires a second node reboot. 5. Following the second node reboot, the node should be in the reverted configuration.

yuqi-zhang · 2024-08-30T01:07:39Z

/lgtm
/hold

I just realized the original hold is for QE verification. I'm going to re-add that in case we missed some edge cases. Feel free to unhold if no longer necessary

openshift-ci · 2024-08-30T01:08:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheesesashimi,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cheesesashimi · 2024-08-30T13:42:48Z

The failure in e2e-gcp-op is most likely unrelated to this. Still, I'd like to get a clean run.

djoshy · 2024-09-10T16:19:13Z

I'm not sure if this case is handled, so wanted to check - what happens to node annotations when reverted? related context here

cheesesashimi · 2024-09-18T14:26:53Z

When reverted, the desiredImage and currentImage annotations should be cleared.

cheesesashimi · 2024-09-18T14:27:33Z

/retest-required

djoshy · 2024-09-18T19:23:03Z

When reverted, the desiredImage and currentImage annotations should be cleared.

Just to clarify, do you mean that annotations are set to blank, or that they are completely removed on the node object?

cheesesashimi · 2024-09-23T16:05:34Z

Just to clarify, do you mean that annotations are set to blank, or that they are completely removed on the node object?

I mean that they are completely removed.

cheesesashimi · 2024-09-23T17:43:45Z

/hold cancel

openshift-ci-robot · 2024-09-23T22:33:20Z

/retest-required

Remaining retests: 0 against base HEAD 1929823 and 2 for PR HEAD 07db49e in total

cheesesashimi · 2024-09-24T13:38:19Z

/retest-required

openshift-ci-robot · 2024-09-24T16:59:22Z

/retest-required

Remaining retests: 0 against base HEAD 1929823 and 2 for PR HEAD 07db49e in total

openshift-ci · 2024-09-24T19:27:04Z

@cheesesashimi: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2024-09-25T03:29:03Z

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.18.0-202409250208.p0.g1ac641c.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 26, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2024

cheesesashimi mentioned this pull request Apr 16, 2024

MCO-1100: enable RHEL entitlements in on-cluster layering #4312

Closed

cheesesashimi mentioned this pull request Apr 23, 2024

MCO-1100: enable RHEL entitlements in on-cluster layering with OCL API #4333

Closed

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 28, 2024

cheesesashimi force-pushed the zzlotnik/revert-to-non-layered branch from 2f6ba4a to 979f59f Compare July 10, 2024 20:43

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2024

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2024

cheesesashimi force-pushed the zzlotnik/revert-to-non-layered branch 2 times, most recently from bf0c69d to aaad79e Compare July 15, 2024 21:07

cheesesashimi marked this pull request as ready for review July 22, 2024 16:14

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2024

openshift-ci bot requested review from sinnykumari and yuqi-zhang July 22, 2024 16:15

yuqi-zhang approved these changes Aug 1, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 1, 2024

openshift-ci bot assigned yuqi-zhang Aug 1, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 1, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 28, 2024

cheesesashimi force-pushed the zzlotnik/revert-to-non-layered branch from ce47545 to 07db49e Compare August 29, 2024 20:14

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2024

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 29, 2024

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Aug 30, 2024

djoshy mentioned this pull request Sep 10, 2024

OCPBUGS-32812: When newly built images rolled out, the update progress is not displaying correctly (went 0 --> 3) #4489

Merged

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2024

openshift-merge-bot bot merged commit 1ac641c into openshift:master Sep 24, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCO-694: revert from layered pool to non-layered pool #4284

MCO-694: revert from layered pool to non-layered pool #4284

cheesesashimi commented Mar 26, 2024 •

edited

Loading

openshift-ci-robot commented Mar 26, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Mar 26, 2024

cheesesashimi commented Mar 26, 2024

cgwalters commented Mar 26, 2024

cheesesashimi commented Mar 29, 2024

openshift-bot commented Jun 28, 2024

cheesesashimi commented Jul 10, 2024

cheesesashimi commented Jul 12, 2024

cheesesashimi commented Jul 15, 2024

cheesesashimi commented Jul 15, 2024

cheesesashimi commented Jul 15, 2024

cheesesashimi commented Jul 16, 2024

cheesesashimi commented Jul 16, 2024

cheesesashimi commented Jul 17, 2024

yuqi-zhang left a comment

yuqi-zhang Aug 1, 2024

yuqi-zhang Aug 1, 2024

yuqi-zhang commented Aug 28, 2024

yuqi-zhang commented Aug 30, 2024

openshift-ci bot commented Aug 30, 2024

cheesesashimi commented Aug 30, 2024

djoshy commented Sep 10, 2024

cheesesashimi commented Sep 18, 2024

cheesesashimi commented Sep 18, 2024

djoshy commented Sep 18, 2024

cheesesashimi commented Sep 23, 2024

cheesesashimi commented Sep 23, 2024

openshift-ci-robot commented Sep 23, 2024

cheesesashimi commented Sep 24, 2024

openshift-ci-robot commented Sep 24, 2024

openshift-ci bot commented Sep 24, 2024

openshift-bot commented Sep 25, 2024

MCO-694: revert from layered pool to non-layered pool #4284

MCO-694: revert from layered pool to non-layered pool #4284

Conversation

cheesesashimi commented Mar 26, 2024 • edited Loading

openshift-ci-robot commented Mar 26, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented Mar 26, 2024

cheesesashimi commented Mar 26, 2024

cgwalters commented Mar 26, 2024

cheesesashimi commented Mar 29, 2024

openshift-bot commented Jun 28, 2024

cheesesashimi commented Jul 10, 2024

cheesesashimi commented Jul 12, 2024

cheesesashimi commented Jul 15, 2024

cheesesashimi commented Jul 15, 2024

cheesesashimi commented Jul 15, 2024

cheesesashimi commented Jul 16, 2024

cheesesashimi commented Jul 16, 2024

cheesesashimi commented Jul 17, 2024

yuqi-zhang left a comment

Choose a reason for hiding this comment

yuqi-zhang Aug 1, 2024

Choose a reason for hiding this comment

yuqi-zhang Aug 1, 2024

Choose a reason for hiding this comment

yuqi-zhang commented Aug 28, 2024

yuqi-zhang commented Aug 30, 2024

openshift-ci bot commented Aug 30, 2024

cheesesashimi commented Aug 30, 2024

djoshy commented Sep 10, 2024

cheesesashimi commented Sep 18, 2024

cheesesashimi commented Sep 18, 2024

djoshy commented Sep 18, 2024

cheesesashimi commented Sep 23, 2024

cheesesashimi commented Sep 23, 2024

openshift-ci-robot commented Sep 23, 2024

cheesesashimi commented Sep 24, 2024

openshift-ci-robot commented Sep 24, 2024

openshift-ci bot commented Sep 24, 2024

openshift-bot commented Sep 25, 2024

cheesesashimi commented Mar 26, 2024 •

edited

Loading

openshift-ci-robot commented Mar 26, 2024 •

edited by openshift-ci bot

Loading