Ignition fails to wipe disk if partition table is corrupted #1596

edcdavid · 2023-10-18T14:14:56Z

Bug coreos/ignition#1728

Operating System Version

Fedora CoreOS 38 (38.20230722.3.0)

Ignition Version

Ignition 2.16.2

Environment

What hardware/cloud provider/hypervisor is being used to run Ignition?
CoreOS running on libvirt VM

Expected Behavior

When "wipe_table: true" is specified the disk device should be wiped even if its partition table is corrupted

Actual Behavior

Ignition fails to wipe disk with sgdisk error:
[ 35.521754] ignition[855]: disks: createPartitions: op(2): op(3): [failed] deleting 0 partitions and creating 0 partitions on "/run/ignition/dev_aliases/dev/vdb": exit status 2: Cmd: "sgdisk" "--zap-all" "/run/ignition/dev_aliases/dev/vdb" Stdout: "GPT data structures destroyed! You may now partition the disk using fdisk or\nother utilities.\n" Stderr: "\aCaution: invalid main GPT header, but valid backup; regenerating main header\nfrom backup!\n\n\aWarning: Invalid CRC on main header data; loaded backup partition table.\nWarning! Main and backup partition tables differ! Use the 'c' and 'e' options\non the recovery & transformation menu to examine the two tables.\n\n\aWarning! Main partition table CRC mismatch! Loaded backup partition table\ninstead of main partition table!\n\nWarning! One or more CRCs don't match. You should repair the disk!\nMain header: ERROR\nBackup header: OK\nMain partition table: ERROR\nBackup partition table: OK\n\nInvalid partition data!\n"

Reproduction Steps

Create a disk with corrupted partition table, for instance:
sgdisk --new=2:0:0 -i=2 /dev/vdb
dd if=/dev/urandom of=/dev/vdb bs=512 count=3
create a VM using this disk. Use ignition to wipe disk (wipe_table: true) and create a partition:

storage:
disks:
  - device: /dev/vdb
    wipe_table: true
    partitions:
      - label: data
        number: 1
        size_mib: 1024
        start_mib: 0

Other Information

Seem to be due to sgdisk returning an error when the disk partition table is corrupted. In this case ignition seems to returns an error on commit fail: https://github.com/coreos/ignition/blob/14151500635e8a0b020897fb2fd68706d8056fbb/internal/exec/stages/disks/partitions.go#L327
Should we ignore/log this error instead of returning?

The text was updated successfully, but these errors were encountered:

travier · 2023-10-19T11:03:50Z

Initially filed in Ignition but moving to the FCOS tracker as this is about FCOS.

jlebon · 2023-10-19T16:00:54Z

Hmm, that seems like a behavioural bug in sgdisk. That said, it looks like wipefs -at gpt also works, so we could consider switching to that. I'd rather we try to get sgdisk fixed though.

edcdavid · 2023-10-20T20:31:23Z

After looking in sgdisk repo, it looks like sgdisk is always returning exit code 2 when the partition table cannot be read, see https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gptcl.cc#l490. So the call to wipe partitions is called in two different places:

when partitions are read successfully and returning 0, see https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gptcl.cc#l431
when partitions failed to be read and returning 2, see https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gptcl.cc#l473

imho, sgdisk should only fail if the function wiping the GPT/MBR is returning an error, right now it is only printing to stderr, for example here: https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gpt.cc#l1529

edcdavid · 2023-10-24T21:53:15Z

The issue seems to have been introduced by this PR to fix a linter error: coreos/ignition#1149
The previous behavior was to assume disk wipe is always successful. If actually not successful, the error would be caught by the next sgdisk calls to create partitions, which also calls the same function to discover partitions.
Why not just revert to the old behavior and whitelist the linter warning?
I created the following PR for sgdisk but I am afraid this new logic will break other tools: https://sourceforge.net/p/gptfdisk/code/merge-requests/35/

When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table was corrupted in the first place, `sgdisk` will exit with code 2 which then breaks boot. As a workaround, Ignition used to retry the invocation but the context around it was lost in coreos#544 and coreos#1149 and the retry was removed and the error-checking was added. So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all operation on failure"), but now with a comment and a test to make sure we don't regress again. Closes: coreos/fedora-coreos-tracker#1596

jlebon · 2023-10-25T16:29:43Z

The issue seems to have been introduced by this PR to fix a linter error: coreos/ignition#1149 The previous behavior was to assume disk wipe is always successful. If actually not successful, the error would be caught by the next sgdisk calls to create partitions, which also calls the same function to discover partitions. Why not just revert to the old behavior and whitelist the linter warning?

Nice find! Doing some git investigations let me to coreos/ignition#544 (comment). Aha! We did use to retry this operation in the past, but it was lost during that PR because it wasn't clear why it was added in the first place (and tests still passed). Hmm, was it perhaps for a similar issue as here? Digging further reveals that yes, that's exactly it: coreos/ignition#162.

Restored this logic in coreos/ignition#1735, now with a comment and a test to make sure we don't regress on this again.

I created the following PR for sgdisk but I am afraid this new logic will break other tools: sourceforge.net/p/gptfdisk/code/merge-requests/35

Thanks! If this merges, we would in theory be able to eventually drop the retry logic in Ignition.

When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table was corrupted in the first place, `sgdisk` will exit with code 2 which then breaks boot. As a workaround, Ignition used to retry the invocation but the context around it was lost in coreos#544 and coreos#1149 and the retry was removed and the error-checking was added. So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all operation on failure"), but now with a comment and a test to make sure we don't regress again. Closes: coreos/fedora-coreos-tracker#1596

edcdavid · 2023-10-25T20:16:50Z

Great, sounds good! Thanks for the fix and test.

travier transferred this issue from coreos/ignition Oct 19, 2023

travier added the kind/bug label Oct 19, 2023

jlebon mentioned this issue Oct 25, 2023

stages/disks: retry sgdisk --zap-all invocation coreos/ignition#1735

Merged

jlebon closed this as completed in coreos/ignition#1735 Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignition fails to wipe disk if partition table is corrupted #1596

Ignition fails to wipe disk if partition table is corrupted #1596

edcdavid commented Oct 18, 2023

travier commented Oct 19, 2023

jlebon commented Oct 19, 2023

edcdavid commented Oct 20, 2023 •

edited

Loading

edcdavid commented Oct 24, 2023 •

edited

Loading

jlebon commented Oct 25, 2023

edcdavid commented Oct 25, 2023

Ignition fails to wipe disk if partition table is corrupted #1596

Ignition fails to wipe disk if partition table is corrupted #1596

Comments

edcdavid commented Oct 18, 2023

Bug coreos/ignition#1728

Operating System Version

Ignition Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

travier commented Oct 19, 2023

jlebon commented Oct 19, 2023

edcdavid commented Oct 20, 2023 • edited Loading

edcdavid commented Oct 24, 2023 • edited Loading

jlebon commented Oct 25, 2023

edcdavid commented Oct 25, 2023

edcdavid commented Oct 20, 2023 •

edited

Loading

edcdavid commented Oct 24, 2023 •

edited

Loading