Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for ppc64le boot mirroring tests #3611

Merged
merged 2 commits into from
Sep 14, 2023

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Sep 13, 2023

These two commits together should fix both #2725 and #3360.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Sep 13, 2023
jlebon added a commit to jlebon/os that referenced this pull request Sep 13, 2023
mantle/platform/qemu.go Outdated Show resolved Hide resolved
@@ -100,6 +100,11 @@ func runBootMirrorTest(c cluster.TestCluster) {
MinMemory: 4096,
},
}
// ppc64le and aarch64 use 64K pages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. does aarch64 really use 64K pages? We recently had a 64K page bug and it really only affected ppc64le.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. For reference, I cargo-culted this block from other places in the code for the ppc64le bit but reading the comment it's true that its mention of aarch64 using 64K pages is incorrect (hence why we're doing things like adding support for a separate aarch64 kernel with the feature).

It looks like the first comment mentioning this was introduced in #1955. The comment may not be accurate but it seems like Prashanth had to bump the memory on aarch64 too to get tests to pass.

For now, I've cross-referenced all these occurrences, but let's look into ripping them out again to see which ones are load-bearing and only keep those (with an up-to-date rationale).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aarch64 used 64k pages by default with RHEL 8. RHEL 9 defaulted to 4K pages, that's why we're introducing #3903
Maybe that's why it was there ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not super comfortable with the comment as is because it can be misleading. i.e. someone reads the comment and accepts it as fact that aarch64 uses 64K pages.. how about we add a (sometimes) after aarch64:

// ppc64le and aarch64 (sometimes) use 64K pages

Maybe we even add the (sometimes) in a second commit so the commit message can have more context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aarch64 used 64k pages by default with RHEL 8. RHEL 9 defaulted to 4K pages, that's why we're introducing #3903 Maybe that's why it was there ?

Nice find! Yes, it all makes sense now.

Maybe we even add the (sometimes) in a second commit so the commit message can have more context.

Hmm, I think we should actually drop the aarch64 special-casing entirely. I'm working on a patch right now and will test it. Assuming nothing breaks, I'll push it as another commit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I ended up splitting this out into a separate PR to not waste CI here: #3617.

On ppc64le and s390x, the Ignition config is passed as a block device.
In the `RemovePrimaryBlockDevice()` code, we need to make sure we don't
inadvertently select the `ignition` block device as the new boot device.

More generally, the reliance on the `virtio-backend` string is weak.
Instead, use a more specific device name for the real block devices when
preparing the QEMU arguments, and filter on that.

Partially fixes: coreos#2725
Partially fixes: coreos#3360
All our other root reprovisioning tests double the memory request on
ppc64le and aarch64 due to the larger page size. Do this for the boot
mirroring tests too and increase the memory request to 8G. With the
current 4G, the tests would sometimes panic during the reboot right
after the primary block device detach.

Even with 8G, the panic still happens, albeit much more rarely. Rather
than bumping the memory even more, I've found that sleeping a bit before
rebooting does the trick.

Partially fixes: coreos#2725
Partially fixes: coreos#3360
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dustymabe dustymabe merged commit b1fff53 into coreos:main Sep 14, 2023
2 checks passed
@dustymabe
Copy link
Member

I reverted the tag in quay for COSA that included this PR. We're seeing failures in multi-arch builds like:

[2023-09-14T21:15:10.883Z] FAIL: iso-offline-install-fromram.uefi (30.136s)
[2023-09-14T21:15:10.883Z]     switching boot order failed: Could not find target disk using QMP.
[2023-09-14T21:15:10.883Z] Full list of block devices: [{disk-1 /machine/peripheral-anon/device[4]/virtio-backend false {0 #block122}} {pflash0 /machine/virt.flash0 false {0 #block379}} {pflash1 /machine/virt.flash1 false {0 #block502}} {installiso /machine/peripheral-anon/device[8]/virtio-backend false {0 #block711}}].

We'll need to investigate in the morning.

@jlebon
Copy link
Member Author

jlebon commented Sep 15, 2023

I reverted the tag in quay for COSA that included this PR. We're seeing failures in multi-arch builds like:

[2023-09-14T21:15:10.883Z] FAIL: iso-offline-install-fromram.uefi (30.136s)
[2023-09-14T21:15:10.883Z]     switching boot order failed: Could not find target disk using QMP.
[2023-09-14T21:15:10.883Z] Full list of block devices: [{disk-1 /machine/peripheral-anon/device[4]/virtio-backend false {0 #block122}} {pflash0 /machine/virt.flash0 false {0 #block379}} {pflash1 /machine/virt.flash1 false {0 #block502}} {installiso /machine/peripheral-anon/device[8]/virtio-backend false {0 #block711}}].

We'll need to investigate in the morning.

Fix in #3620.

dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this pull request Sep 15, 2023
jlebon added a commit to jlebon/coreos-assembler that referenced this pull request Sep 20, 2023
Follow-up to coreos#3611.

We're still noticing occasional kernel panics with 30s on ppc64le.
Bump it to 60s to see if that helps as a last ditch effort before re-
disabling these tests until someone has time to dig into it properly.

While we're here, make it specific to ppc64le.
dustymabe pushed a commit that referenced this pull request Sep 20, 2023
Follow-up to #3611.

We're still noticing occasional kernel panics with 30s on ppc64le.
Bump it to 60s to see if that helps as a last ditch effort before re-
disabling these tests until someone has time to dig into it properly.

While we're here, make it specific to ppc64le.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants