Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mantle/kola: Add function to enhance upgrade stability #3938

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

aaradhak
Copy link
Member

This commit introduces the waitForUpgradeToBeStaged function to improve the stability of kola upgrade test by reducing timeout-related failures.
The new function sets up a systemd path unit to monitor updates in the /ostree/repo/refs/heads/ostree/1/1 directory, triggering a stop on wait.service once changes are detected.
By ensuring we wait later in the upgrade process, we minimize the waiting period in runFnAndWaitForRebootIntoVersion, focusing only on the actual reboot phase.

Author : Dusty Mabe [email protected]
Ref: coreos/fedora-coreos-tracker#1805

//
// Note: if systemd-run ever gains the ability to --wait when
// generating a path unit then the below can be simplified.
c.RunCmdSync(m, "sudo systemd-run -u refchanged --path-property=PathChanged=/ostree/repo/refs/heads/ostree/1/1 systemctl stop wait.service")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical API for this is /ostree/deploy, which ostree touches whenever it changed the deployments exactly for the purpose of other things wanting to monitor changes:

https://github.com/ostreedev/ostree/blob/ab8a7f7855b0e0a7f3fe7214b77521268b994ce4/src/libostree/ostree-sysroot.c#L449

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

so we just need to change this to path-property=PathChanged=/ostree/deploy then and update the comment accordingly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though I wonder if it's even necessary. Once the machine goes down for a reboot, it'll stop the wait unit below, so you should already get the desired effect. I guess RunCmdSync might mark that as an error though depending on how systemd-run exits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right.. I think once the reboot actually starts it's hard to guarantee any info makes it back outside the VM (i.e. relying on network to still be up).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the path to ostree/deploy as suggested here

@@ -328,6 +349,7 @@ func rpmostreeRebase(c cluster.TestCluster, m platform.Machine, ref, version str
// we use systemd-run here so that we can test the --reboot path
// without having SSH not exit cleanly, which would cause an error
c.RunCmdSyncf(m, "sudo systemd-run rpm-ostree rebase --reboot %s", ref)
waitForUpgradeToBeStaged(c, m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this makes sense here. In this case we're running rpm-ostree rebase synchronously. It'll have already done the deployment (and initiated a reboot) by the time you get to this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly.. systemd-run is sending rpm-ostree off on it's own little boat to finish independently IIUC.

//
// Note: if systemd-run ever gains the ability to --wait when
// generating a path unit then the below can be simplified.
c.RunCmdSync(m, "sudo systemd-run -u refchanged --path-property=PathChanged=/ostree/repo/refs/heads/ostree/1/1 systemctl stop wait.service")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though I wonder if it's even necessary. Once the machine goes down for a reboot, it'll stop the wait unit below, so you should already get the desired effect. I guess RunCmdSync might mark that as an error though depending on how systemd-run exits.

@dustymabe
Copy link
Member

@aaradhak @jlebon - can we follow up on this one?

@aaradhak aaradhak requested a review from dustymabe December 13, 2024 16:13
@aaradhak
Copy link
Member Author

I have made the change suggested in path-property=PathChanged=/ostree/deploy.

dustymabe
dustymabe previously approved these changes Dec 18, 2024
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dustymabe
Copy link
Member

CI is failing because you need to run go fmt on the code.

@dustymabe
Copy link
Member

need to squash the commits down into one

This commit introduces the `waitForUpgradeToBeStaged` function to
improve the stability of kola upgrade test by reducing timeout-related
failures.
The new function sets up a systemd path unit to monitor updates in the
`/ostree/repo/refs/heads/ostree/1/1` directory, triggering a stop on
`wait.service` once changes are detected.
By ensuring we wait later in the upgrade process, we minimize the
waiting period in `runFnAndWaitForRebootIntoVersion`, focusing
only on the actual reboot phase.

Author : Dusty Mabe <[email protected]>
Ref: coreos/fedora-coreos-tracker#1805
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aaradhak aaradhak merged commit 242a88e into coreos:main Dec 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants