Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle re-invocations of coreos-installer on a new disk #976

Closed
cgwalters opened this issue Sep 27, 2021 · 28 comments
Closed

Handle re-invocations of coreos-installer on a new disk #976

cgwalters opened this issue Sep 27, 2021 · 28 comments
Assignees
Labels
jira for syncing to jira kind/enhancement

Comments

@cgwalters
Copy link
Member

There's an old discussion on this I'm not finding but:

We've seen issues in the past with both FCOS (I believe) and RHCOS (for sure) where people had done an install on one disk (including a /boot partition) and then re-installed on a separate disk.

Because grub picks /boot by label and the OS picks /boot, we can end up racing/flapping between picking a /boot partition on startup.

We generate a new UUID for /boot on firstboot, but I think we should instead:

  • generate a new UUID at coreos-installer time
  • teach coreos-installer to edit /boot/grub.cfg to use it
  • Also inject boot=UUID= into the kernel cmdline

This is basically what Anaconda does.

@lucab
Copy link
Contributor

lucab commented Sep 27, 2021

I'm not sure this can effectively co-exist with the multipath logic, and it also splits the boot handling logic in "coreos-installer provisioned nodes" vs everything else. Additionally, we do have other initrd units depending on the "boot" label.

Do we have an alternative way of pinning the boot partition on first boot, but still retaining auto-discovery capabilities? Alternatively, should we instead fail the boot early on if we detect ambiguities in partitions/labels?

@jlebon
Copy link
Member

jlebon commented Sep 27, 2021

I think also it doesn't really help the case where the stale boot device still has precedence in the boot order. Ideally, we would want to catch this, which would involve dynamic code.

Alternatively, should we instead fail the boot early on if we detect ambiguities in partitions/labels?

That was going to be my suggestion as well. Our stance so far has been that we own boot and other special filesystem labels at boot time. It'd also require special-casing multipath, though that should be fine (by the time we care about the boot partition, multipath should've already created multipathed devices, so we can filter out multipath-owned subdevices).

Realistically this will mostly be a bare metal thing, so it could make sense to also have coreos-installer look for this case and print out a clear warning to not waste users' time in debugging this (we could e.g. suggest the wipefs command to run so that they can just run it themselves and then reboot into the new install).

@cgwalters
Copy link
Member Author

cgwalters commented Sep 27, 2021

There's another possibility here, which is that we change the default (i.e. cosa-generated) disk images to use a unique-per-build (could be random to start, but maybe hashed from the build version number or so) UUID in both the grub config and the kernel cmdline.

If we did that then we'd need to change our firstboot uuid logic to either stop doing /boot, or update things again.

This would at least mitigate things for the case of doing an install at version X, then an install of version Y != X on a separate disk.

Realistically this will mostly be a bare metal thing, so it could make sense to also have coreos-installer look for this case and print out a clear warning to not waste users' time in debugging this

Yeah, that would also help a lot I think.

@cgwalters
Copy link
Member Author

Though of course this whole /boot issue mirrors all the discussions in #946 which mainly focused on the EFI system partition.

@lucab
Copy link
Contributor

lucab commented Sep 29, 2021

One more data point from one of the related RHCOS cases, (it's still unclear how but) the two disks ended up in this situation:

$ grep boot blkid-output 
/dev/sdc1: LABEL="boot" UUID="96d15588-3596-4b3c-adca-a2ff7279ea63" TYPE="ext4" PARTLABEL="boot" PARTUUID="7ff5e1fe-4eb4-479b-b13d-bdfaf7251fa4"
/dev/sdb1: LABEL="boot" UUID="96d15588-3596-4b3c-adca-a2ff7279ea63" TYPE="ext4" PARTLABEL="boot" PARTUUID="7ff5e1fe-4eb4-479b-b13d-bdfaf7251fa4"

@jlebon
Copy link
Member

jlebon commented Sep 29, 2021

One more data point from one of the related RHCOS cases, (it's still unclear how but) the two disks ended up in this situation:

$ grep boot blkid-output 
/dev/sdc1: LABEL="boot" UUID="96d15588-3596-4b3c-adca-a2ff7279ea63" TYPE="ext4" PARTLABEL="boot" PARTUUID="7ff5e1fe-4eb4-479b-b13d-bdfaf7251fa4"
/dev/sdb1: LABEL="boot" UUID="96d15588-3596-4b3c-adca-a2ff7279ea63" TYPE="ext4" PARTLABEL="boot" PARTUUID="7ff5e1fe-4eb4-479b-b13d-bdfaf7251fa4"

One easy way this can happen is if it's actually multipath, but multipathd isn't configured. Or was multipath already ruled out in that case?

@dustymabe
Copy link
Member

We discussed this in the community meeting today.

We did not reach any clear decision, but there was lots of great technical discussion (see log). Towards the end we were oscillating around the proposal of:

13:37:51*   dustymabe | #proposal attempt to find a good way to show the user an error and fail the boot
                      | rather than change our UUID design for this particular invalid setup.

Will continue the discussion in this ticket and also in coming meetings if necesssary.

nikita-dubrovskii added a commit to nikita-dubrovskii/coreos-installer that referenced this issue Sep 30, 2021
nikita-dubrovskii added a commit to nikita-dubrovskii/fedora-coreos-config that referenced this issue Sep 30, 2021
nikita-dubrovskii added a commit to nikita-dubrovskii/fedora-coreos-config that referenced this issue Sep 30, 2021
nikita-dubrovskii added a commit to nikita-dubrovskii/fedora-coreos-config that referenced this issue Sep 30, 2021
@jlebon
Copy link
Member

jlebon commented Sep 30, 2021

Was chatting about this with Luca in IRC. One idea here is to have coreos-boot-edit.service (which already runs rdcore today) also do the checking that there is only one boot partition. This service runs late in the boot process, so by then ideally all boot devices should've showed up.

If there's only one boot partition, then "claim" it by adding the rootfs UUID in e.g. /boot/.rootfs or so. If there's already a stamp file there, then we know it was already claimed and also error out.

But I think we also need to be careful about the real root mounting boot. In the initramfs, we don't touch boot at all after the first boot, and it's always possible that a new device is plugged in after first boot. One thing we can easily do there is move the boot.mount generation from the real root to the initramfs, as part of coreos-boot-edit, and we use the actual boot UUID symlink in the mount unit.

nikita-dubrovskii added a commit to nikita-dubrovskii/coreos-installer that referenced this issue Oct 1, 2021
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Oct 1, 2021
This makes the mount more resistant to other filesystems labeled `boot`
that may get plugged in at any point.

Part of: coreos/fedora-coreos-tracker#976
@jlebon
Copy link
Member

jlebon commented Oct 1, 2021

coreos/fedora-coreos-config#1256 handles the mount by UUID bit in the real root. I think this makes sense to do regardless of whether we do the stamp file bit proposed above.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Oct 1, 2021
This is analogous to `systemd-fstab-generator` parsing `root=` kargs.
There is precedence for this in the FIPS module. In fact, we lift code
from there to make sure we're API compatible.

The goal of supporting a `boot` karg is to eliminate possible races for
the `by-label/boot` symlink in the real root if multiple exist.

Part of: coreos/fedora-coreos-tracker#976
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Oct 1, 2021
This ensures that the rootfs will always mount the correct boot
filesystem in the future (see previous patch).

Part of: coreos/fedora-coreos-tracker#976
@jlebon
Copy link
Member

jlebon commented Oct 1, 2021

coreos/fedora-coreos-config#1256 changes the strategy now to add a boot=UUID= karg for subsequent boots (see discussions), just like we do with root=UUID=. This squashes boot races for the real root and binds the bootfs to the rootfs.

With the stamp file idea above, it makes the relationship 1:1 by preventing any other rootfs from binding to the same bootfs.

nikita-dubrovskii added a commit to nikita-dubrovskii/coreos-installer that referenced this issue Oct 4, 2021
nikita-dubrovskii added a commit to nikita-dubrovskii/coreos-installer that referenced this issue Oct 4, 2021
jlebon added a commit to coreos/fedora-coreos-config that referenced this issue Jan 27, 2022
This binds the bootloader to the bootfs and the bootfs to the rootfs.

Part of coreos/fedora-coreos-tracker#976.
@jlebon
Copy link
Member

jlebon commented Jan 28, 2022

Just to be clear, I'm referring to partition type guidnot unique partition guid.

Yup, sorry I should've used --type-uuid in my comment to make that clearer.

Instead of --fs-uuid it would be something like --type-uuid, and always look for bc13c2ff-59e6-4262-a352-b275fd6f7172 or whatever is decided on.

Ahh OK, so we're talking about something that hasn't landed yet in GRUB.

Also, is the rpm-ostree /boot fundamentally the same thing as XBOOTLDR, and should use that partition type GUID? Or is it different enough, such that it gets its own partition type GUID? There's arguments both ways, right? That's the dilemma. How strict does an implementation have to follow Boot Loader Spec to use the partition type GUID defined in the spec?

Right yeah. It mostly matches, but strictly following the BLS means it should be FAT IIUC, which it isn't. Probably worth discussing this part in #1038 instead.

@jlebon
Copy link
Member

jlebon commented Jan 28, 2022

OK, checklist for rolling this out:

Then I think we can close this.

@cmurf
Copy link

cmurf commented Jan 28, 2022

Ahh OK, so we're talking about something that hasn't landed yet in GRUB.

I'm not sure if it's been asked for by anyone until now?
https://lists.gnu.org/archive/html/grub-devel/2022-01/msg00171.html

cgwalters pushed a commit to cgwalters/fedora-coreos-config that referenced this issue Feb 10, 2022
This binds the bootloader to the bootfs and the bootfs to the rootfs.

Part of coreos/fedora-coreos-tracker#976.
nikita-dubrovskii added a commit to nikita-dubrovskii/fedora-coreos-config that referenced this issue Feb 11, 2022
Aborts firstboot when system has several filesystems labeled `boot`

Fix for coreos/fedora-coreos-tracker#976

Signed-off-by: Nikita Dubrovskii <[email protected]>
jlebon pushed a commit to coreos/fedora-coreos-config that referenced this issue Feb 14, 2022
Aborts firstboot when system has several filesystems labeled `boot`

Fix for coreos/fedora-coreos-tracker#976

Signed-off-by: Nikita Dubrovskii <[email protected]>
@jlebon
Copy link
Member

jlebon commented Feb 14, 2022

This is done now!

@jlebon jlebon closed this as completed Feb 14, 2022
@dustymabe
Copy link
Member

What's the first release of FCOS we can try this in? Can we add some docs?

@dustymabe dustymabe added the status/pending-testing-release Fixed upstream. Waiting on a testing release. label Feb 14, 2022
@jlebon
Copy link
Member

jlebon commented Feb 14, 2022

What's the first release of FCOS we can try this in?

The one with coreos/fedora-coreos-config#1269, which is part of this week's testing release.

Can we add some docs?

coreos/fedora-coreos-docs#359

@dustymabe
Copy link
Member

The fix for this went into testing stream release 35.20220213.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Feb 16, 2022
@travier
Copy link
Member

travier commented Mar 16, 2022

I think this one went into stable 35.20220213.3.0. @dustymabe?

@dustymabe
Copy link
Member

The fix for this went into stable stream release 35.20220213.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Mar 16, 2022
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This is analogous to `systemd-fstab-generator` parsing `root=` kargs.
There is a precedent for this in the FIPS module. In fact, we lift code
from there to make sure we're API compatible.

The goal of supporting a `boot` karg is to eliminate possible races for
the `by-label/boot` symlink in the real root if multiple exist.

Part of: coreos/fedora-coreos-tracker#976
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This ensures that the rootfs will always mount the correct boot
filesystem in the future (see previous patch).

Part of: coreos/fedora-coreos-tracker#976
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This binds the bootloader to the bootfs and the bootfs to the rootfs.

Part of coreos/fedora-coreos-tracker#976.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Aborts firstboot when system has several filesystems labeled `boot`

Fix for coreos/fedora-coreos-tracker#976

Signed-off-by: Nikita Dubrovskii <[email protected]>
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This is analogous to `systemd-fstab-generator` parsing `root=` kargs.
There is a precedent for this in the FIPS module. In fact, we lift code
from there to make sure we're API compatible.

The goal of supporting a `boot` karg is to eliminate possible races for
the `by-label/boot` symlink in the real root if multiple exist.

Part of: coreos/fedora-coreos-tracker#976
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This ensures that the rootfs will always mount the correct boot
filesystem in the future (see previous patch).

Part of: coreos/fedora-coreos-tracker#976
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This binds the bootloader to the bootfs and the bootfs to the rootfs.

Part of coreos/fedora-coreos-tracker#976.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Aborts firstboot when system has several filesystems labeled `boot`

Fix for coreos/fedora-coreos-tracker#976

Signed-off-by: Nikita Dubrovskii <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/enhancement
Projects
None yet
Development

No branches or pull requests

8 participants