You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a very small boot partition (1GiB) in 2-device raid1. Trying to use btrfs replace to replace a missing device did not succeed.
btrfs replace start 2 /dev/nvme0n1p2 /mnt/boot
Reported kernel messages:
info (device nvme1n1p2): dev_replace from <missing disk> (devid 2) to /dev/nvme0n1p2 started
warning (device nvme1n1p2): failed setting block group ro: -28
error (device nvme1n1p2): btrfs_scrub_dev(<missing disk>, 2, /dev/nvme0n1p2) failed -28
The output of btrfs replace itself was confusing because it started by saying it was doing fstrim and then emitted the warning and error, so the sequence made it appear as though the fstrim of the new device was related to the failure, both because (1) it did not say that that the fstrim completed successfully, and (2) because the error number is not symbolicated all I knew was “-28” instead of “ENOSPC”. btrfs replace provided no additional feedback or guidance about what was the problem or what to do next.
Next step I tried was to delete a 250MiB file from the filesystem, which worked, so now the filesystem had only about 100MiB of data on it, but it still did not allow btrfs replace to work, it continued to fail with exactly the same error.
Next step I tried was to balance down to single, which succeeded where btrfs replace did not:
info (device nvme1n1p2): balance: start -f -dconvert=single -mconvert=single -sconvert=single […]
info (device nvme1n1p2): relocating block […]
[…] more normal relocation messages […]
info (device nvme1n1p2): balance: ended with status: 0
Then I deleted the missing device so I could add the new one, and that also seemed to work:
info (device nvme1n1p2): device deleted: missing
Then I tried to add the new device, and this is where things started going even more wrong (sorry about the abbreviated trace with typos, I am transcribing from a photo taken on a phone, and it was late and I did not even manage to get the whole thing into the camera frame 🥴):
At this point only a few btrfs-progs functions would work, I could run btrfs filesystem show and it showed that the second device did seem to be part of the filesystem as expected, using 0 bytes as expected, and btrfs device stats showed two devices with 0 errors on both, but anything else like btrfs filesystem usage would crash.
At this point it was impossible to do much of anything. sync (which I ran after copying everything off the boot partition to non-volatile storage in case I had to blow away the filesystem) hanged forever until it eventually responded to a kill. umount hanged forever and would not respond to a kill. Rebooting the system hanged forever and it had to be forcibly powered off and rebooted. Once rebooted, the second device continued to show that it was attached to the filesystem, and everything was fine, I was able to successfully rebalance to raid1, and everything seems OK.
This seems like probably some problem on the kernel side, at least with the sysfs issues, and maybe is actually multiple issues, but it started with btrfs replace, so I am reporting here.
For the record the current usage state of the filesystem after rebalance to single, dev remove, dev add, rebalance to raid1, looks like this:
I have a very small boot partition (1GiB) in 2-device raid1. Trying to use
btrfs replace
to replace a missing device did not succeed.btrfs replace start 2 /dev/nvme0n1p2 /mnt/boot
Reported kernel messages:
The output of
btrfs replace
itself was confusing because it started by saying it was doing fstrim and then emitted the warning and error, so the sequence made it appear as though the fstrim of the new device was related to the failure, both because (1) it did not say that that the fstrim completed successfully, and (2) because the error number is not symbolicated all I knew was “-28” instead of “ENOSPC”.btrfs replace
provided no additional feedback or guidance about what was the problem or what to do next.Next step I tried was to delete a 250MiB file from the filesystem, which worked, so now the filesystem had only about 100MiB of data on it, but it still did not allow
btrfs replace
to work, it continued to fail with exactly the same error.Next step I tried was to balance down to single, which succeeded where
btrfs replace
did not:Then I deleted the missing device so I could add the new one, and that also seemed to work:
Then I tried to add the new device, and this is where things started going even more wrong (sorry about the abbreviated trace with typos, I am transcribing from a photo taken on a phone, and it was late and I did not even manage to get the whole thing into the camera frame 🥴):
After this point trying to run most btrfs tools would segfault in a similar way:
At this point only a few btrfs-progs functions would work, I could run
btrfs filesystem show
and it showed that the second device did seem to be part of the filesystem as expected, using 0 bytes as expected, andbtrfs device stats
showed two devices with 0 errors on both, but anything else likebtrfs filesystem usage
would crash.At this point it was impossible to do much of anything.
sync
(which I ran after copying everything off the boot partition to non-volatile storage in case I had to blow away the filesystem) hanged forever until it eventually responded to a kill.umount
hanged forever and would not respond to a kill. Rebooting the system hanged forever and it had to be forcibly powered off and rebooted. Once rebooted, the second device continued to show that it was attached to the filesystem, and everything was fine, I was able to successfully rebalance to raid1, and everything seems OK.This seems like probably some problem on the kernel side, at least with the sysfs issues, and maybe is actually multiple issues, but it started with
btrfs replace
, so I am reporting here.For the record the current usage state of the filesystem after rebalance to single, dev remove, dev add, rebalance to raid1, looks like this:
Kernel 6.9.9
btrfs-progs 6.6.3
The text was updated successfully, but these errors were encountered: