`btrfs replace` confusing messages on ENOSPC, broken sysfs, segfaults, umount hang, had to force power off #870

csnover · 2024-08-13T22:38:12Z

I have a very small boot partition (1GiB) in 2-device raid1. Trying to use btrfs replace to replace a missing device did not succeed.

btrfs replace start 2 /dev/nvme0n1p2 /mnt/boot

Reported kernel messages:

info (device nvme1n1p2): dev_replace from <missing disk> (devid 2) to /dev/nvme0n1p2 started
warning (device nvme1n1p2): failed setting block group ro: -28
error (device nvme1n1p2): btrfs_scrub_dev(<missing disk>, 2, /dev/nvme0n1p2) failed -28

The output of btrfs replace itself was confusing because it started by saying it was doing fstrim and then emitted the warning and error, so the sequence made it appear as though the fstrim of the new device was related to the failure, both because (1) it did not say that that the fstrim completed successfully, and (2) because the error number is not symbolicated all I knew was “-28” instead of “ENOSPC”. btrfs replace provided no additional feedback or guidance about what was the problem or what to do next.

Next step I tried was to delete a 250MiB file from the filesystem, which worked, so now the filesystem had only about 100MiB of data on it, but it still did not allow btrfs replace to work, it continued to fail with exactly the same error.

Next step I tried was to balance down to single, which succeeded where btrfs replace did not:

info (device nvme1n1p2): balance: start -f -dconvert=single -mconvert=single -sconvert=single […]
info (device nvme1n1p2): relocating block […]
[…] more normal relocation messages […]
info (device nvme1n1p2): balance: ended with status: 0

Then I deleted the missing device so I could add the new one, and that also seemed to work:

info (device nvme1n1p2): device deleted: missing

Then I tried to add the new device, and this is where things started going even more wrong (sorry about the abbreviated trace with typos, I am transcribing from a photo taken on a phone, and it was late and I did not even manage to get the whole thing into the camera frame 🥴):

sysfs: cannot create duplicate filename '/fs/btrfs/<uuid>/devinfo/2'
CPU: 11 PID: 3860 Comm: btrfs Not tainted 6.9.9-amd64 #1  Debian 6.9.9-1
Hardware name: […]
Call Trace:
<TASK>
dump_stack_lvl
sysfs_warn_dup
sysfs_create_dir_ns
kobject_add_internal
kobject_init_and_add
? srso_alias_return_thunk
btrfs_sysfs_add_device
btrfs_init_new_device
? __kmalloc_node_track_caller
? btrfs_ioctl
? srso_return_think
? __check_object_size
btrfs_ioctl
? srso_alias_return_thunk
? filename_lookup
__x64_sys_ioctl
do_syscall_64
? getname_flags.part.0
? srso_alias_return_thunk
? mntput_no_expire
[…]

After this point trying to run most btrfs tools would segfault in a similar way:

RIP: 0010:sysfs_kf_seq_show
[…]
<TASK>
? die
? do_trap
? do_error_trap
? exc_stack_segment
? asm_exc_stack_segment
? sysfs_kf_seq_show
seq_read_iter
? security_file_permission
vfs_read
ksys_read
do_syscall_64
? path_openat
? blkdev_ioctl
? do_filp_open
? do_sys_openat2
? syscall_exit_to_user_mode
? do_syscall_64
? entry_SYSCALL_64_after_huframe
[…]

At this point only a few btrfs-progs functions would work, I could run btrfs filesystem show and it showed that the second device did seem to be part of the filesystem as expected, using 0 bytes as expected, and btrfs device stats showed two devices with 0 errors on both, but anything else like btrfs filesystem usage would crash.

At this point it was impossible to do much of anything. sync (which I ran after copying everything off the boot partition to non-volatile storage in case I had to blow away the filesystem) hanged forever until it eventually responded to a kill. umount hanged forever and would not respond to a kill. Rebooting the system hanged forever and it had to be forcibly powered off and rebooted. Once rebooted, the second device continued to show that it was attached to the filesystem, and everything was fine, I was able to successfully rebalance to raid1, and everything seems OK.

This seems like probably some problem on the kernel side, at least with the sysfs issues, and maybe is actually multiple issues, but it started with btrfs replace, so I am reporting here.

For the record the current usage state of the filesystem after rebalance to single, dev remove, dev add, rebalance to raid1, looks like this:

Overall:
    Device size:           2.00GiB
    Device allocated:          1.87GiB
    Device unallocated:      130.00MiB
    Device missing:          0.00B
    Device slack:            0.00B
    Used:            622.49MiB
    Free (estimated):        489.64MiB  (min: 489.64MiB)
    Free (statfs, df):       488.64MiB
    Data ratio:               2.00
    Metadata ratio:           2.00
    Global reserve:        5.50MiB  (used: 0.00B)
    Multiple profiles:              no

Data,RAID1: Size:735.00MiB, Used:310.36MiB (42.23%)
   /dev/nvme1n1p2    735.00MiB
   /dev/nvme0n1p2    735.00MiB

Metadata,RAID1: Size:192.00MiB, Used:896.00KiB (0.46%)
   /dev/nvme1n1p2    192.00MiB
   /dev/nvme0n1p2    192.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/nvme1n1p2     32.00MiB
   /dev/nvme0n1p2     32.00MiB

Unallocated:
   /dev/nvme1n1p2     65.00MiB
   /dev/nvme0n1p2     65.00MiB

Kernel 6.9.9
btrfs-progs 6.6.3

The text was updated successfully, but these errors were encountered:

kdave added bug kernel something in kernel has to be done too labels Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`btrfs replace` confusing messages on ENOSPC, broken sysfs, segfaults, umount hang, had to force power off #870

`btrfs replace` confusing messages on ENOSPC, broken sysfs, segfaults, umount hang, had to force power off #870

csnover commented Aug 13, 2024 •

edited

Loading

btrfs replace confusing messages on ENOSPC, broken sysfs, segfaults, umount hang, had to force power off #870

btrfs replace confusing messages on ENOSPC, broken sysfs, segfaults, umount hang, had to force power off #870

Comments

csnover commented Aug 13, 2024 • edited Loading

`btrfs replace` confusing messages on ENOSPC, broken sysfs, segfaults, umount hang, had to force power off #870

`btrfs replace` confusing messages on ENOSPC, broken sysfs, segfaults, umount hang, had to force power off #870

csnover commented Aug 13, 2024 •

edited

Loading