-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignition fails to wipe disk if partition table is corrupted #1596
Comments
Initially filed in Ignition but moving to the FCOS tracker as this is about FCOS. |
Hmm, that seems like a behavioural bug in |
After looking in sgdisk repo, it looks like sgdisk is always returning exit code 2 when the partition table cannot be read, see https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gptcl.cc#l490. So the call to wipe partitions is called in two different places:
imho, sgdisk should only fail if the function wiping the GPT/MBR is returning an error, right now it is only printing to stderr, for example here: https://sourceforge.net/p/gptfdisk/code/ci/master/tree/gpt.cc#l1529 |
The issue seems to have been introduced by this PR to fix a linter error: coreos/ignition#1149 |
When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table was corrupted in the first place, `sgdisk` will exit with code 2 which then breaks boot. As a workaround, Ignition used to retry the invocation but the context around it was lost in coreos#544 and coreos#1149 and the retry was removed and the error-checking was added. So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all operation on failure"), but now with a comment and a test to make sure we don't regress again. Closes: coreos/fedora-coreos-tracker#1596
Nice find! Doing some git investigations let me to coreos/ignition#544 (comment). Aha! We did use to retry this operation in the past, but it was lost during that PR because it wasn't clear why it was added in the first place (and tests still passed). Hmm, was it perhaps for a similar issue as here? Digging further reveals that yes, that's exactly it: coreos/ignition#162. Restored this logic in coreos/ignition#1735, now with a comment and a test to make sure we don't regress on this again.
Thanks! If this merges, we would in theory be able to eventually drop the retry logic in Ignition. |
When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table was corrupted in the first place, `sgdisk` will exit with code 2 which then breaks boot. As a workaround, Ignition used to retry the invocation but the context around it was lost in coreos#544 and coreos#1149 and the retry was removed and the error-checking was added. So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all operation on failure"), but now with a comment and a test to make sure we don't regress again. Closes: coreos/fedora-coreos-tracker#1596
When `wipeTable` is enabled, we run `sgdisk --zap-all`. But if the table was corrupted in the first place, `sgdisk` will exit with code 2 which then breaks boot. As a workaround, Ignition used to retry the invocation but the context around it was lost in coreos#544 and coreos#1149 and the retry was removed and the error-checking was added. So this patch effectively re-applies 94c98bc ("sgdisk: retry zap-all operation on failure"), but now with a comment and a test to make sure we don't regress again. Closes: coreos/fedora-coreos-tracker#1596
Great, sounds good! Thanks for the fix and test. |
Bug coreos/ignition#1728
Operating System Version
Fedora CoreOS 38 (38.20230722.3.0)
Ignition Version
Ignition 2.16.2
Environment
What hardware/cloud provider/hypervisor is being used to run Ignition?
CoreOS running on libvirt VM
Expected Behavior
When "wipe_table: true" is specified the disk device should be wiped even if its partition table is corrupted
Actual Behavior
Ignition fails to wipe disk with sgdisk error:
[ 35.521754] ignition[855]: disks: createPartitions: op(2): op(3): [failed] deleting 0 partitions and creating 0 partitions on "/run/ignition/dev_aliases/dev/vdb": exit status 2: Cmd: "sgdisk" "--zap-all" "/run/ignition/dev_aliases/dev/vdb" Stdout: "GPT data structures destroyed! You may now partition the disk using fdisk or\nother utilities.\n" Stderr: "\aCaution: invalid main GPT header, but valid backup; regenerating main header\nfrom backup!\n\n\aWarning: Invalid CRC on main header data; loaded backup partition table.\nWarning! Main and backup partition tables differ! Use the 'c' and 'e' options\non the recovery & transformation menu to examine the two tables.\n\n\aWarning! Main partition table CRC mismatch! Loaded backup partition table\ninstead of main partition table!\n\nWarning! One or more CRCs don't match. You should repair the disk!\nMain header: ERROR\nBackup header: OK\nMain partition table: ERROR\nBackup partition table: OK\n\nInvalid partition data!\n"
Reproduction Steps
sgdisk --new=2:0:0 -i=2 /dev/vdb
dd if=/dev/urandom of=/dev/vdb bs=512 count=3
Other Information
Seem to be due to sgdisk returning an error when the disk partition table is corrupted. In this case ignition seems to returns an error on commit fail: https://github.com/coreos/ignition/blob/14151500635e8a0b020897fb2fd68706d8056fbb/internal/exec/stages/disks/partitions.go#L327
Should we ignore/log this error instead of returning?
The text was updated successfully, but these errors were encountered: