Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different partition layout after re-provisioning between version 39 and 40 #1745

Open
iwanb opened this issue Jun 7, 2024 · 8 comments
Open
Labels

Comments

@iwanb
Copy link

iwanb commented Jun 7, 2024

Describe the bug

Re-provisioning a bare metal machine using the example storage from the doc, first installed with 39.20231101.3.0, then reinstalled with 40.20240519.3.0, leads to different partition layout, and thus the reprovisioning fails.

Reproduction steps

I've reproduced this in qemu here: https://github.com/iwanb/coreos-reprovision-baremetal

I haven't tried to narrow down the version where the change was introduced.

Expected behavior

Partitions stay the same when re-provisioning.

From this doc I would expect it to be the case: https://docs.fedoraproject.org/en-US/fedora-coreos/bare-metal/#_data_persistence

I could imagine that it can be hard to provide full backwards and forward compatibility for this, so an alternative would be to document how to specify the partitions more precisely, and warn that re-provisioning could otherwise break.

Actual behavior

Partition difference, the var partition is smaller in the new version:

$ diff disk-before disk-after
6c6
< Disk identifier: 333AA36F-1E5A-4770-9721-3217CD45AD37
---
> Disk identifier: 8FD44CE1-F723-46C4-9585-1314E286CC33
13c13
< /dev/nbd0p5 17827840 41943006 24115167 11.5G Linux filesystem
---
> /dev/nbd0p5 17827840 41940992 24113153 11.5G Linux filesystem

System details

Bare metal (saw it there first) and QEMU (reproduced).

Butane or Ignition config

From https://docs.fedoraproject.org/en-US/fedora-coreos/storage/#_setting_up_separate_var_mounts

variant: fcos
version: 1.5.0
storage:
  disks:
  - # The link to the block device the OS was booted from.
    device: /dev/disk/by-id/coreos-boot-disk
    # We do not want to wipe the partition table since this is the primary
    # device.
    wipe_table: false
    partitions:
    - number: 4
      label: root
      # Allocate at least 8 GiB to the rootfs. See NOTE above about this.
      size_mib: 8192
      resize: true
    - size_mib: 0
      # We assign a descriptive label to the partition. This is important
      # for referring to it in a device-agnostic way in other parts of the
      # configuration.
      label: var
  filesystems:
    - path: /var
      device: /dev/disk/by-partlabel/var
      # We can select the filesystem we'd like.
      format: ext4
      # Ask Butane to generate a mount unit for us so that this filesystem
      # gets mounted in the real root.
      with_mount_unit: true


### Additional information

_No response_
@iwanb iwanb added the kind/bug label Jun 7, 2024
@hrismarin
Copy link

In the docs section you point to (Data persistence), there is a link to Using persistent state section where it states that you can safely set wipe_partition_entry when reusing a disk, since those don’t modify the contents of a partition. Adding wipe_partition_entry: true to your Butane config should allow you to re-provision nodes with different releases of FCOS while maintaining persistent state.

So your butane configuration could be like this:

variant: fcos
version: 1.5.0
storage:
  disks:
  - # The link to the block device the OS was booted from.
    device: /dev/disk/by-id/coreos-boot-disk
    # We do not want to wipe the partition table since this is the primary
    # device.
    wipe_table: false
    partitions:
    - number: 4
      label: root
      # Allocate at least 8 GiB to the rootfs. See NOTE above about this.
      size_mib: 8192
      resize: true
    - size_mib: 0
      # Check if existing partition matches the specified one,
      # delete existing partition and create specified partition
      # if it does not match.
      wipe_partition_entry: true
      # We assign a descriptive label to the partition. This is important
      # for referring to it in a device-agnostic way in other parts of the
      # configuration.
      label: var
  filesystems:
    - path: /var
      device: /dev/disk/by-partlabel/var
      # We can select the filesystem we'd like.
      format: ext4
      # Ask Butane to generate a mount unit for us so that this filesystem
      # gets mounted in the real root.
      with_mount_unit: true

@iwanb
Copy link
Author

iwanb commented Jun 8, 2024

I tried a few things, what works is to set size_mib explicitly instead of 0. wipe_partition_entry: true did not work, nor resize: true, nor removing size_mib, which I would have expected to work.

It looks like the change is that size_mib: 0 now aligns to the last MiB of the disk, instead of the last sector. I guess it makes sense since the size is in MiB.

From the Qemu diff:

< /dev/nbd0p5 17827840 41943006 24115167 11.5G Linux filesystem
---
> /dev/nbd0p5 17827840 41940992 24113153 11.5G Linux filesystem

41940992 * 512 / 1024 / 1024 = 20479

while:

41943006 * 512 / 1024 / 1024 = 20479.98...

So probably to fix my setup I would resize the partition to a MiB boundary, then set size_mib explicitly to match the size of my disk.

@hrismarin
Copy link

Not sure what is causing the issue but it is reproducible. Setting size_mib explicitly works for me, but results in few MiB of unpartitioned space. wipe_partition_entry: true uses all available disk space. I tested the above Butane config on a bare metal machine with one SSD attached.

@jlebon
Copy link
Member

jlebon commented Jul 5, 2024

Nice catch! This is, similarly to #1384, fallout from our switch from sgdisk to sfdisk to format the disk. One difference between the two is how they set the first usable and last usable LBA of the disk (these are fields in the GPT header that determine the valid range of the disk which may contain partitions).

sgdisk sets the first usable LBA to be 34, which is the minimum required by the spec, and the last usable LBA to N-34 (i.e. the last sector of the disk-34), which leaves just enough for the backup GPT. sfdisk sets the first usable LBA to 2048, and the last usable LBA to N-2048.

Both sgdisk and sfdisk align partitions to 1M and will refuse to create partitions below sector 2048, but it seems like sfdisk tries to reflect this in the first LBA field itself.

Even though Ignition uses sgdisk to create new partitions, sgdisk still respects the existing last usable LBA field from the GPT header (initially written by sfdisk during disk creation) and so sets the last sector of the /var partition to N-2048.

This explains the difference you're seeing. Using your numbers for example, let's compare the sizes:

24115167 − 24113153 = 2014

So we lost 2014 sectors, which corresponds to 2048-34.

Now thankfully, this shouldn't result in data loss (assuming wipe_filesystem is off, which it should be if you're expecting to reuse an earlier formatted filesystem). But as you've seen, mounting will fail because parts of the filesystem now lay outside the partition.

Ideally we would've caught this as part of the f40 rebase and included information about it in our communications. I don't think we currently have data persistence tests, but we should.

We could still send something to coreos-status. Though AFAIK this is the first report of this. But it may simply be that people using data persistence haven't reprovisioned from newer images yet.

@jlebon
Copy link
Member

jlebon commented Jul 5, 2024

For anyone hitting this, you can restore the last partition to its original size using the following sfdisk script:

dev=/dev/vda # replace with target device
diskinfo=$(sfdisk --json $dev)
partnum=$(jq '.partitiontable.partitions | length' <<< "$diskinfo")
start_lba=$(jq '.partitiontable.partitions[-1].start' <<< "$diskinfo")
eval $(lsblk -Pbdyo SIZE,LOG-SEC $dev)
size_in_lba=$((SIZE/LOG_SEC))
last_lba=$((size_in_lba-34))
echo -e "last-lba:$last_lba\n$start_lba $((last_lba-start_lba+1))" | sfdisk $dev -N $partnum --no-reread
partx --update --nr $partnum $dev

That should allow mounting the filesystem and recovering your data. For filesystems that support it, it would be best to then resize it down so that it can fit in the smaller partition size on the next reprovision. For filesystems that don't (e.g. XFS), you would have to reformat (but temporarily, you can run the script above on every reprovision Before=local-fs.target).

@jlebon jlebon added the meeting topics for meetings label Jul 5, 2024
@gursewak1997 gursewak1997 removed the meeting topics for meetings label Jul 17, 2024
@iwanb
Copy link
Author

iwanb commented Jul 23, 2024

Is there a way to somehow reuse the partitions as-is without recreating the partition table when reprovisioning? I have an XFS partition so a bit annoying to reformat it. I suppose the script is an option.

@jlebon
Copy link
Member

jlebon commented Aug 29, 2024

Someone asked OOB:

it seems there is a possibility for unintended data corruption - right or wrong?

i think i could imagine data corruption in some scenarios, but in the "obvious" case, no.
the partition table change doesn't touch the filesystem. and the filesystem will refuse to mount in the truncated partition. the steps in #1745 (comment) can then be used to fix it, but all it's doing is changing the table, not the filesystem.

where i could see data corruption though is if in one of your reprovisioning, you want to tack on another partition after the one you've been reusing if you e.g. moved to a larger disk and the lost blocks cross over a 1M boundary. then the new partition would take ownership of those last few blocks, and an mkfs in that partition would corrupt things. but that would also require changing your ignition config to go from "take all the space remaining" to "be sized exactly this much", which is already a pretty high-wire act.

@jlebon
Copy link
Member

jlebon commented Aug 29, 2024

Is there a way to somehow reuse the partitions as-is without recreating the partition table when reprovisioning? I have an XFS partition so a bit annoying to reformat it. I suppose the script is an option.

Not exactly for this purpose. The install is image-based and the image includes the GPT table. coreos-installer install does have e.g. --save-partlabel and --save-partindex which seem like it could help, but I think that might error out when sanity-checking the table when seeing that the last LBA is before the end of the partition you're trying to save.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants