Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignition fails to wipe disk if partition table is corrupted #1596

Closed
edcdavid opened this issue Oct 18, 2023 · 6 comments · Fixed by coreos/ignition#1735
Closed

Ignition fails to wipe disk if partition table is corrupted #1596

edcdavid opened this issue Oct 18, 2023 · 6 comments · Fixed by coreos/ignition#1735
Labels

Comments

@edcdavid
Copy link

Bug coreos/ignition#1728

Operating System Version

Fedora CoreOS 38 (38.20230722.3.0)

Ignition Version

Ignition 2.16.2

Environment

What hardware/cloud provider/hypervisor is being used to run Ignition?
CoreOS running on libvirt VM

Expected Behavior

When "wipe_table: true" is specified the disk device should be wiped even if its partition table is corrupted

Actual Behavior

Ignition fails to wipe disk with sgdisk error:
[ 35.521754] ignition[855]: disks: createPartitions: op(2): op(3): [failed] deleting 0 partitions and creating 0 partitions on "/run/ignition/dev_aliases/dev/vdb": exit status 2: Cmd: "sgdisk" "--zap-all" "/run/ignition/dev_aliases/dev/vdb" Stdout: "GPT data structures destroyed! You may now partition the disk using fdisk or\nother utilities.\n" Stderr: "\aCaution: invalid main GPT header, but valid backup; regenerating main header\nfrom backup!\n\n\aWarning: Invalid CRC on main header data; loaded backup partition table.\nWarning! Main and backup partition tables differ! Use the 'c' and 'e' options\non the recovery & transformation menu to examine the two tables.\n\n\aWarning! Main partition table CRC mismatch! Loaded backup partition table\ninstead of main partition table!\n\nWarning! One or more CRCs don't match. You should repair the disk!\nMain header: ERROR\nBackup header: OK\nMain partition table: ERROR\nBackup partition table: OK\n\nInvalid partition data!\n"

Reproduction Steps

  1. Create a disk with corrupted partition table, for instance:
    sgdisk --new=2:0:0 -i=2 /dev/vdb
    dd if=/dev/urandom of=/dev/vdb bs=512 count=3
  2. create a VM using this disk. Use ignition to wipe disk (wipe_table: true) and create a partition:
storage:
disks:
  - device: /dev/vdb
    wipe_table: true
    partitions:
      - label: data
        number: 1
        size_mib: 1024
        start_mib: 0

Other Information

Seem to be due to sgdisk returning an error when the disk partition table is corrupted. In this case ignition seems to returns an error on commit fail: https://github.com/coreos/ignition/blob/14151500635e8a0b020897fb2fd68706d8056fbb/internal/exec/stages/disks/partitions.go#L327
Should we ignore/log this error instead of returning?

@travier travier transferred this issue from coreos/ignition Oct 19, 2023
@travier
Copy link
Member

travier commented Oct 19, 2023

Initially filed in Ignition but moving to the FCOS tracker as this is about FCOS.

@jlebon
Copy link
Member

jlebon commented Oct 19, 2023

Hmm, that seems like a behavioural bug in sgdisk. That said, it looks like wipefs -at gpt also works, so we could consider switching to that. I'd rather we try to get sgdisk fixed though.

@edcdavid
Copy link
Author

edcdavid commented Oct 20, 2023

After looking in sgdisk repo, it looks like sgdisk is always returning exit code 2 when the partition table cannot be read, see https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gptcl.cc#l490. So the call to wipe partitions is called in two different places:

imho, sgdisk should only fail if the function wiping the GPT/MBR is returning an error, right now it is only printing to stderr, for example here: https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gpt.cc#l1529

@edcdavid
Copy link
Author

edcdavid commented Oct 24, 2023

The issue seems to have been introduced by this PR to fix a linter error: coreos/ignition#1149
The previous behavior was to assume disk wipe is always successful. If actually not successful, the error would be caught by the next sgdisk calls to create partitions, which also calls the same function to discover partitions.
Why not just revert to the old behavior and whitelist the linter warning?
I created the following PR for sgdisk but I am afraid this new logic will break other tools: https://sourceforge.net/p/gptfdisk/code/merge-requests/35/

jlebon added a commit to jlebon/ignition that referenced this issue Oct 25, 2023
When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table
was corrupted in the first place, `sgdisk` will exit with code 2 which
then breaks boot.

As a workaround, Ignition used to retry the invocation but the context
around it was lost in coreos#544 and coreos#1149 and the retry was removed and
the error-checking was added.

So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all
operation on failure"), but now with a comment and a test to make sure
we don't regress again.

Closes: coreos/fedora-coreos-tracker#1596
@jlebon
Copy link
Member

jlebon commented Oct 25, 2023

The issue seems to have been introduced by this PR to fix a linter error: coreos/ignition#1149 The previous behavior was to assume disk wipe is always successful. If actually not successful, the error would be caught by the next sgdisk calls to create partitions, which also calls the same function to discover partitions. Why not just revert to the old behavior and whitelist the linter warning?

Nice find! Doing some git investigations let me to coreos/ignition#544 (comment). Aha! We did use to retry this operation in the past, but it was lost during that PR because it wasn't clear why it was added in the first place (and tests still passed). Hmm, was it perhaps for a similar issue as here? Digging further reveals that yes, that's exactly it: coreos/ignition#162.

Restored this logic in coreos/ignition#1735, now with a comment and a test to make sure we don't regress on this again.

I created the following PR for sgdisk but I am afraid this new logic will break other tools: sourceforge.net/p/gptfdisk/code/merge-requests/35

Thanks! If this merges, we would in theory be able to eventually drop the retry logic in Ignition.

jlebon added a commit to jlebon/ignition that referenced this issue Oct 25, 2023
When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table
was corrupted in the first place, `sgdisk` will exit with code 2 which
then breaks boot.

As a workaround, Ignition used to retry the invocation but the context
around it was lost in coreos#544 and coreos#1149 and the retry was removed and
the error-checking was added.

So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all
operation on failure"), but now with a comment and a test to make sure
we don't regress again.

Closes: coreos/fedora-coreos-tracker#1596
jlebon added a commit to jlebon/ignition that referenced this issue Oct 25, 2023
When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table
was corrupted in the first place, `sgdisk` will exit with code 2 which
then breaks boot.

As a workaround, Ignition used to retry the invocation but the context
around it was lost in coreos#544 and coreos#1149 and the retry was removed and
the error-checking was added.

So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all
operation on failure"), but now with a comment and a test to make sure
we don't regress again.

Closes: coreos/fedora-coreos-tracker#1596
@edcdavid
Copy link
Author

Great, sounds good! Thanks for the fix and test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants