Swap deadlock in 0.7.9 #7734

runderwo · 2018-07-21T16:35:07Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	stretch
Linux Kernel	4.16.0-0.bpo.2-amd64
Architecture	amd64
ZFS Version	0.7.9-3~bpo9+1
SPL Version	0.7.9-3~bpo9+1

Describe the problem you're observing

System deadlocked while forcing a page out to a swap zvol. Unfortunately I do not have the rest of the backtraces.

Describe how to reproduce the problem

rsync a filesystem with O(1M) files.

Include any warning/errors/backtraces from the system logs

NAME        PROPERTY               VALUE                  SOURCE
rpool/swap  type                   volume                 -
rpool/swap  creation               Tue Jun 26 17:52 2018  -
rpool/swap  used                   8.50G                  -
rpool/swap  available              8.84T                  -
rpool/swap  referenced             128K                   -
rpool/swap  compressratio          1.09x                  -
rpool/swap  reservation            none                   default
rpool/swap  volsize                8G                     local
rpool/swap  volblocksize           4K                     -
rpool/swap  checksum               on                     default
rpool/swap  compression            zle                    local
rpool/swap  readonly               off                    default
rpool/swap  createtxg              156102                 -
rpool/swap  copies                 1                      default
rpool/swap  refreservation         8.50G                  local
rpool/swap  guid                   11434631848701244988   -
rpool/swap  primarycache           metadata               local
rpool/swap  secondarycache         all                    default
rpool/swap  usedbysnapshots        0B                     -
rpool/swap  usedbydataset          128K                   -
rpool/swap  usedbychildren         0B                     -
rpool/swap  usedbyrefreservation   8.50G                  -
rpool/swap  logbias                throughput             local
rpool/swap  dedup                  off                    default
rpool/swap  mlslabel               none                   default
rpool/swap  sync                   always                 local
rpool/swap  refcompressratio       1.09x                  -
rpool/swap  written                128K                   -
rpool/swap  logicalused            46K                    -
rpool/swap  logicalreferenced      46K                    -
rpool/swap  volmode                default                default
rpool/swap  snapshot_limit         none                   default
rpool/swap  snapshot_count         none                   default
rpool/swap  snapdev                hidden                 default
rpool/swap  context                none                   default
rpool/swap  fscontext              none                   default
rpool/swap  defcontext             none                   default
rpool/swap  rootcontext            none                   default
rpool/swap  redundant_metadata     all                    default
rpool/swap  com.sun:auto-snapshot  false                  local

The text was updated successfully, but these errors were encountered:

shartse · 2018-08-28T19:11:33Z

@jgallag88 and I have been observing this as well. Here's our bug report with a simple reproducer:

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	18.04
Linux Kernel	4.15.0-33-generic
Architecture	amd64
ZFS Version	Master
SPL Version	Master

Describe the problem you're observing

With zvol swap devices, high memory operations hang indefinitely despite there being a lot of swap space available. It appears that the hang occurs in ZIL code while ZFS is performing an allocation for a zio as part of its work to swap pages.

We also tested this situation with a ext4 configured swap device and did not see any hangs.

This is the memory usage on one our of typical VMs:

              total        used        free      shared  buff/cache   available
Mem:           7.3G        3.6G        3.1G         32M        603M        3.4G
Swap:          4.0G          0B        4.0G

Reproducing the problem

Configure a `zvol` as a swap device

zfs create -V 4G -b "$(getconf PAGESIZE)" -o logbias=throughput -o sync=always  -o primarycache=metadata rpool/swap
mkswap -f /dev/zvol/rpool/swap
swapon /dev/zvol/rpool/swap

Run a high memory operation

Running something that will use up all your memory, for example stress:
stress --vm 10 --vm-keep --vm-bytes 512M
Check that all memory is being used with
watch free -h
Eventually the system will start to swap but quickly hang and not recover.

Examples

Example of the thread trying to swap out page:

PID: 1      TASK: ffff8cadf6208000  CPU: 1   COMMAND: "systemd"
 #0 [ffffb10840c671f0] __schedule at ffffffffb0594571
 #1 [ffffb10840c67288] schedule at ffffffffb0594bac
 #2 [ffffb10840c67298] schedule_preempt_disabled at ffffffffb0594e8e
 #3 [ffffb10840c672a8] __mutex_lock at ffffffffb059654c
 #4 [ffffb10840c67338] __mutex_lock_slowpath at ffffffffb05968a3
 #5 [ffffb10840c67350] mutex_lock at ffffffffb05968df
 #6 [ffffb10840c67368] zil_commit_writer at ffffffffc0723d21 [zfs]
 #7 [ffffb10840c673a0] zil_commit_impl at ffffffffc0723e66 [zfs]
 #8 [ffffb10840c673c0] zil_commit at ffffffffc0723ef0 [zfs]
 #9 [ffffb10840c673e8] zvol_write at ffffffffc0746e20 [zfs]
#10 [ffffb10840c67488] zvol_request at ffffffffc0747fba [zfs]
#11 [ffffb10840c674d8] generic_make_request at ffffffffb00534e4
#12 [ffffb10840c67538] submit_bio at ffffffffb0053733
#13 [ffffb10840c67588] __swap_writepage at ffffffffafe26e83
#14 [ffffb10840c67620] swap_writepage at ffffffffafe26f44
#15 [ffffb10840c67648] pageout at ffffffffafde683b
#16 [ffffb10840c676d8] shrink_page_list at ffffffffafde9caa
#17 [ffffb10840c677a0] shrink_inactive_list at ffffffffafdea6f2
#18 [ffffb10840c67858] shrink_node_memcg at ffffffffafdeb1a9
#19 [ffffb10840c67958] shrink_node at ffffffffafdeb6b7
#20 [ffffb10840c679e8] do_try_to_free_pages at ffffffffafdeb989
#21 [ffffb10840c67a40] try_to_free_pages at ffffffffafdebcde
#22 [ffffb10840c67ac8] __alloc_pages_slowpath at ffffffffafdd958e
#23 [ffffb10840c67bd8] __alloc_pages_nodemask at ffffffffafdda203
#24 [ffffb10840c67c40] alloc_pages_current at ffffffffafe385da
#25 [ffffb10840c67c70] __page_cache_alloc at ffffffffafdcc131
#26 [ffffb10840c67c90] filemap_fault at ffffffffafdcf978
#27 [ffffb10840c67d48] __do_fault at ffffffffafe0a864
#28 [ffffb10840c67d70] handle_pte_fault at ffffffffafe0f9e3
#29 [ffffb10840c67dc8] __handle_mm_fault at ffffffffafe10598
#30 [ffffb10840c67e70] handle_mm_fault at ffffffffafe10791
#31 [ffffb10840c67ea8] __do_page_fault at ffffffffafc74990
#32 [ffffb10840c67f20] do_page_fault at ffffffffafc74c3e
#33 [ffffb10840c67f50] page_fault at ffffffffb06015e5
    RIP: 00007fcfe704d69a  RSP: 00007ffe9178ead0  RFLAGS: 00010202
    RAX: 0000000000000001  RBX: 000055ca1d2bd8c0  RCX: 00007fcfe753abb7
    RDX: 0000000000000056  RSI: 00007ffe9178ead0  RDI: 0000000000000000
    RBP: 00007ffe9178efe0   R8: 0000000000000000   R9: 61742e6369736162
    R10: 00000000ffffffff  R11: 0000000000000000  R12: 0000000000000001
    R13: ffffffffffffffff  R14: 00007ffe9178ead0  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

Two places where zios were stuck

PID: 570    TASK: ffff91c660c32d80  CPU: 0   COMMAND: "z_wr_int"
 #0 [fffffe0000008cc0] machine_kexec at ffffffff896631c3
 #1 [fffffe0000008d20] __crash_kexec at ffffffff8972a479
 #2 [fffffe0000008de8] panic at ffffffff8968c7b0
 #3 [fffffe0000008e70] nmi_panic at ffffffff8968c329
 #4 [fffffe0000008e80] unknown_nmi_error at ffffffff896318e7
 #5 [fffffe0000008ea0] default_do_nmi at ffffffff89631a8e
 #6 [fffffe0000008ec8] do_nmi at ffffffff89631bc9
 #7 [fffffe0000008ef0] end_repeat_nmi at ffffffff8a0019eb
    [exception RIP: putback_inactive_pages+501]
    RIP: ffffffff897ea245  RSP: ffff9fc501e9f658  RFLAGS: 00000003
    RAX: ffff91c677012800  RBX: ffffeb4ac7e52a80  RCX: 0000000000000001
    RDX: ffffeb4ac7380720  RSI: 0000000000000000  RDI: ffff91c677012800
    RBP: ffff9fc501e9f6d8   R8: 0000000000027f20   R9: 0000000000036164
    R10: ffff91c67ffd2000  R11: 00000000000000e5  R12: 0000000000000000
    R13: ffffeb4ac7e52aa0  R14: ffff91c677012800  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #8 [ffff9fc501e9f658] putback_inactive_pages at ffffffff897ea245
 #9 [ffff9fc501e9f6e0] shrink_inactive_list at ffffffff897ea74d
#10 [ffff9fc501e9f798] shrink_node_memcg at ffffffff897eb1a9
#11 [ffff9fc501e9f898] shrink_node at ffffffff897eb6b7
#12 [ffff9fc501e9f930] do_try_to_free_pages at ffffffff897eb989
#13 [ffff9fc501e9f988] try_to_free_pages at ffffffff897ebcde
#14 [ffff9fc501e9fa10] __alloc_pages_slowpath at ffffffff897d958e
#15 [ffff9fc501e9fb20] __alloc_pages_nodemask at ffffffff897da203
#16 [ffff9fc501e9fb88] __alloc_pages at ffffffffc07851db [zfs]
#17 [ffff9fc501e9fb98] __alloc_pages_node at ffffffffc07851f1 [zfs]
#18 [ffff9fc501e9fba8] alloc_pages_node at ffffffffc0785225 [zfs]
#19 [ffff9fc501e9fbc0] abd_alloc_chunk at ffffffffc078523e [zfs]
#20 [ffff9fc501e9fbd0] abd_alloc_pages.constprop.8 at ffffffffc0785e85 [zfs]
#21 [ffff9fc501e9fc58] abd_alloc at ffffffffc07862f5 [zfs]
#22 [ffff9fc501e9fc80] abd_alloc_for_io at ffffffffc078647e [zfs]
#23 [ffff9fc501e9fc90] vdev_queue_aggregate at ffffffffc086187a [zfs]
#24 [ffff9fc501e9fd00] vdev_queue_io_to_issue at ffffffffc0861e11 [zfs]
#25 [ffff9fc501e9fd48] vdev_queue_io_done at ffffffffc0862582 [zfs]
#26 [ffff9fc501e9fd88] zio_vdev_io_done at ffffffffc08c61cf [zfs]
#27 [ffff9fc501e9fdb0] zio_execute at ffffffffc08c6c2c [zfs]
#28 [ffff9fc501e9fdf8] taskq_thread at ffffffffc065254d [spl]
#29 [ffff9fc501e9ff08] kthread at ffffffff896ae531
#30 [ffff9fc501e9ff50] ret_from_fork at ffffffff8a000205

And the next place:

PID: 567    TASK: ffff97afe0e05b00  CPU: 0   COMMAND: "z_wr_iss"
 #0 [fffffe0000008cc0] machine_kexec at ffffffffa64631c3
 #1 [fffffe0000008d20] __crash_kexec at ffffffffa652a479
 #2 [fffffe0000008de8] panic at ffffffffa648c7b0
 #3 [fffffe0000008e70] nmi_panic at ffffffffa648c329
 #4 [fffffe0000008e80] unknown_nmi_error at ffffffffa64318e7
 #5 [fffffe0000008ea0] default_do_nmi at ffffffffa6431a8e
 #6 [fffffe0000008ec8] do_nmi at ffffffffa6431bc9
 #7 [fffffe0000008ef0] end_repeat_nmi at ffffffffa6e019eb
    [exception RIP: page_referenced_one]
    RIP: ffffffffa661e5f0  RSP: ffffa65681e5f3a8  RFLAGS: 00000246
    RAX: ffffffffa661e5f0  RBX: 00000000ffc49000  RCX: ffffa65681e5f418
    RDX: 00000000ffc49000  RSI: ffff97afc231cea0  RDI: ffffdde407197dc0
    RBP: ffffa65681e5f3f8   R8: 0000000000080000   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000004  R12: ffffdde407197dc0
    R13: ffff97afc231cea0  R14: ffffa65681e5f430  R15: ffff97afc104dd80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #8 [ffffa65681e5f3a8] page_referenced_one at ffffffffa661e5f0
 #9 [ffffa65681e5f3a8] rmap_walk_anon at ffffffffa661e47a
#10 [ffffa65681e5f400] rmap_walk at ffffffffa66203c8
#11 [ffffa65681e5f410] page_referenced at ffffffffa66204bc
#12 [ffffa65681e5f488] shrink_page_list at ffffffffa65e9960
#13 [ffffa65681e5f550] shrink_inactive_list at ffffffffa65ea6f2
#14 [ffffa65681e5f608] shrink_node_memcg at ffffffffa65eb1a9
#15 [ffffa65681e5f708] shrink_node at ffffffffa65eb6b7
#16 [ffffa65681e5f7a0] do_try_to_free_pages at ffffffffa65eb989
#17 [ffffa65681e5f7f8] try_to_free_pages at ffffffffa65ebcde
#18 [ffffa65681e5f880] __alloc_pages_slowpath at ffffffffa65d958e
#19 [ffffa65681e5f990] __alloc_pages_nodemask at ffffffffa65da203
#20 [ffffa65681e5f9f8] alloc_pages_current at ffffffffa66385da
#21 [ffffa65681e5fa28] new_slab at ffffffffa6645884
#22 [ffffa65681e5fa98] ___slab_alloc at ffffffffa66463b7
#23 [ffffa65681e5fb58] __slab_alloc at ffffffffa6646530
#24 [ffffa65681e5fb80] kmem_cache_alloc at ffffffffa66466cb
#25 [ffffa65681e5fbc0] spl_kmem_cache_alloc at ffffffffc06d6a3b [spl]
#26 [ffffa65681e5fbf8] zio_create at ffffffffc0960601 [zfs]
#27 [ffffa65681e5fc50] zio_vdev_child_io at ffffffffc0961fcc [zfs]
#28 [ffffa65681e5fcf8] vdev_mirror_io_start at ffffffffc08f6b4e [zfs]
#29 [ffffa65681e5fd70] zio_vdev_io_start at ffffffffc0966ac7 [zfs]
#30 [ffffa65681e5fdb0] zio_execute at ffffffffc095cc2c [zfs]
#31 [ffffa65681e5fdf8] taskq_thread at ffffffffc06db54d [spl]
#32 [ffffa65681e5ff08] kthread at ffffffffa64ae531
#33 [ffffa65681e5ff50] ret_from_fork at ffffffffa6e00205

siv0 · 2018-09-04T14:03:57Z

Can reproduce this as well on:

System information

Type	Version/Name
Distribution Name	Proxmox VE (based on Debian)
Distribution Version	5.2 / stretch
Linux Kernel	4.15.18-4-pve (based on ubuntu-bionic)
Architecture	amd64
ZFS Version	0.7.9-pve3~bpo9
SPL Version	0.7.9-1

swap zvol was created as above with " -o compression=zle-o secondarycache=none" as additional parameters.
The system is a small qemu-machine, using a qemu-disk w/o zfs as swap does not lockup the machine.

I could reproduce the issue with plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.7.9 (from stretch-backports)
The Problem does not occur on plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.6.5.9

Please let me know if/how I can help in hunting this down/fixing this.

console output w/ hung tasks on 4.15.18-4 (ZFS 0.7.9):

[  846.823830] INFO: task systemd:1 blocked for more than 120 seconds.
[  846.824978]       Tainted: P           O     4.15.18-4-pve #1
[  846.825888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.826774] systemd         D    0     1      0 0x00000000
[  846.827409] Call Trace:
[  846.827722]  __schedule+0x3e0/0x870
[  846.828145]  schedule+0x36/0x80
[  846.828541]  cv_wait_common+0x11e/0x140 [spl]
[  846.829088]  ? wait_woken+0x80/0x80
[  846.829477]  __cv_wait+0x15/0x20 [spl]
[  846.829968]  txg_wait_open+0xb0/0x100 [zfs]
[  846.830455]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.830992]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.831485]  zvol_write+0x175/0x620 [zfs]
[  846.831994]  zvol_request+0x24a/0x300 [zfs]
[  846.832513]  ? SyS_madvise+0xa20/0xa20
[  846.833124]  generic_make_request+0x123/0x2f0
[  846.833908]  submit_bio+0x73/0x140
[  846.834491]  ? submit_bio+0x73/0x140
[  846.835098]  ? get_swap_bio+0xcf/0x100
[  846.835780]  __swap_writepage+0x345/0x3b0
[  846.836463]  ? __frontswap_store+0x73/0x100
[  846.837190]  swap_writepage+0x34/0x90
[  846.838503]  pageout.isra.53+0x1e5/0x330
[  846.839874]  shrink_page_list+0x955/0xb70
[  846.841326]  shrink_inactive_list+0x256/0x5e0
[  846.842768]  ? next_arg+0x80/0x110
[  846.843851]  shrink_node_memcg+0x365/0x780
[  846.845052]  shrink_node+0xe1/0x310
[  846.846113]  ? shrink_node+0xe1/0x310
[  846.847153]  do_try_to_free_pages+0xef/0x360
[  846.848258]  try_to_free_pages+0xf2/0x1b0
[  846.849392]  __alloc_pages_slowpath+0x401/0xf10
[  846.850513]  ? __page_cache_alloc+0x86/0x90
[  846.851612]  __alloc_pages_nodemask+0x25b/0x280
[  846.852784]  alloc_pages_current+0x6a/0xe0
[  846.853903]  __page_cache_alloc+0x86/0x90
[  846.855011]  filemap_fault+0x369/0x740
[  846.856057]  ? page_add_file_rmap+0xf7/0x150
[  846.857482]  ? filemap_map_pages+0x369/0x380
[  846.858618]  ext4_filemap_fault+0x31/0x44
[  846.859708]  __do_fault+0x24/0xe3
[  846.860751]  __handle_mm_fault+0xcd7/0x11e0
[  846.861842]  ? ep_read_events_proc+0xd0/0xd0
[  846.863129]  handle_mm_fault+0xce/0x1b0
[  846.864175]  __do_page_fault+0x25e/0x500
[  846.865251]  ? wake_up_q+0x80/0x80
[  846.866175]  do_page_fault+0x2e/0xe0
[  846.867108]  ? async_page_fault+0x2f/0x50
[  846.868080]  do_async_page_fault+0x1a/0x80
[  846.869078]  async_page_fault+0x45/0x50
[  846.869985] RIP: 0033:0x7f551d2d30a3
[  846.870874] RSP: 002b:00007ffdbd747e28 EFLAGS: 00010246
[  846.872075] RAX: 0000000000000001 RBX: 0000558ebcbf3120 RCX: 00007f551d2d30a3
[  846.873473] RDX: 0000000000000041 RSI: 00007ffdbd747e30 RDI: 0000000000000004
[  846.874811] RBP: 00007ffdbd748240 R08: 431bde82d7b634db R09: 0000000000000100
[  846.876098] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ffdbd747e30
[  846.877594] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000003
[  846.878882] INFO: task kswapd0:34 blocked for more than 120 seconds.
[  846.880099]       Tainted: P           O     4.15.18-4-pve #1
[  846.881379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.882873] kswapd0         D    0    34      2 0x80000000
[  846.883989] Call Trace:
[  846.884834]  __schedule+0x3e0/0x870
[  846.885771]  schedule+0x36/0x80
[  846.886722]  cv_wait_common+0x11e/0x140 [spl]
[  846.887853]  ? wait_woken+0x80/0x80
[  846.888889]  __cv_wait+0x15/0x20 [spl]
[  846.889927]  txg_wait_open+0xb0/0x100 [zfs]
[  846.891027]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.892132]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.893185]  zvol_write+0x175/0x620 [zfs]
[  846.894136]  ? avl_add+0x74/0xa0 [zavl]
[  846.895112]  zvol_request+0x24a/0x300 [zfs]
[  846.896124]  ? SyS_madvise+0xa20/0xa20
[  846.897118]  generic_make_request+0x123/0x2f0
[  846.898189]  submit_bio+0x73/0x140
[  846.899171]  ? submit_bio+0x73/0x140
[  846.900115]  ? get_swap_bio+0xcf/0x100
[  846.901183]  __swap_writepage+0x345/0x3b0
[  846.902212]  ? __frontswap_store+0x73/0x100
[  846.903221]  swap_writepage+0x34/0x90
[  846.904185]  pageout.isra.53+0x1e5/0x330
[  846.905207]  shrink_page_list+0x955/0xb70
[  846.906238]  shrink_inactive_list+0x256/0x5e0
[  846.907257]  ? __wake_up+0x13/0x20
[  846.908211]  ? __cv_signal+0x2d/0x40 [spl]
[  846.909286]  ? next_arg+0x80/0x110
[  846.910237]  shrink_node_memcg+0x365/0x780
[  846.911275]  shrink_node+0xe1/0x310
[  846.912253]  ? shrink_node+0xe1/0x310
[  846.913273]  kswapd+0x386/0x770
[  846.914213]  kthread+0x105/0x140
[  846.915168]  ? mem_cgroup_shrink_node+0x180/0x180
[  846.916326]  ? kthread_create_worker_on_cpu+0x70/0x70
[  846.917519]  ret_from_fork+0x35/0x40
[  846.918631] INFO: task jbd2/sda1-8:196 blocked for more than 120 seconds.
[  846.920121]       Tainted: P           O     4.15.18-4-pve #1
[  846.921514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.923138] jbd2/sda1-8     D    0   196      2 0x80000000
[  846.924465] Call Trace:
[  846.925372]  __schedule+0x3e0/0x870
[  846.926425]  schedule+0x36/0x80
[  846.927439]  cv_wait_common+0x11e/0x140 [spl]
[  846.928567]  ? wait_woken+0x80/0x80
[  846.929635]  __cv_wait+0x15/0x20 [spl]
[  846.930714]  txg_wait_open+0xb0/0x100 [zfs]
[  846.931826]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.932937]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.934063]  zvol_write+0x175/0x620 [zfs]
[  846.935090]  ? avl_find+0x5f/0xa0 [zavl]
[  846.936147]  zvol_request+0x24a/0x300 [zfs]
[  846.937292]  ? SyS_madvise+0xa20/0xa20
[  846.938274]  generic_make_request+0x123/0x2f0
[  846.939445]  submit_bio+0x73/0x140
[  846.940395]  ? submit_bio+0x73/0x140
[  846.941384]  ? get_swap_bio+0xcf/0x100
[  846.942427]  __swap_writepage+0x345/0x3b0
[  846.943393]  ? __frontswap_store+0x73/0x100
[  846.944395]  swap_writepage+0x34/0x90
[  846.945419]  pageout.isra.53+0x1e5/0x330
[  846.946424]  shrink_page_list+0x955/0xb70
[  846.947423]  shrink_inactive_list+0x256/0x5e0
[  846.948465]  ? next_arg+0x80/0x110
[  846.949443]  shrink_node_memcg+0x365/0x780
[  846.950518]  ? _cond_resched+0x1a/0x50
[  846.951533]  shrink_node+0xe1/0x310
[  846.952469]  ? shrink_node+0xe1/0x310
[  846.953446]  do_try_to_free_pages+0xef/0x360
[  846.954449]  try_to_free_pages+0xf2/0x1b0
[  846.955670]  __alloc_pages_slowpath+0x401/0xf10
[  846.956738]  ? ext4_map_blocks+0x436/0x5d0
[  846.957745]  ? __switch_to_asm+0x34/0x70
[  846.958753]  __alloc_pages_nodemask+0x25b/0x280
[  846.959807]  alloc_pages_current+0x6a/0xe0
[  846.960860]  __page_cache_alloc+0x86/0x90
[  846.961867]  pagecache_get_page+0xab/0x2b0
[  846.962968]  __getblk_gfp+0x109/0x300
[  846.964009]  jbd2_journal_get_descriptor_buffer+0x5e/0xe0
[  846.965220]  journal_submit_commit_record+0x84/0x200
[  846.966361]  ? wait_woken+0x80/0x80
[  846.967350]  jbd2_journal_commit_transaction+0x1299/0x1720
[  846.968574]  ? __switch_to_asm+0x34/0x70
[  846.969665]  ? __switch_to_asm+0x40/0x70
[  846.970672]  ? finish_task_switch+0x74/0x200
[  846.971787]  kjournald2+0xc8/0x260
[  846.972772]  ? kjournald2+0xc8/0x260
[  846.973770]  ? wait_woken+0x80/0x80
[  846.974696]  kthread+0x105/0x140
[  846.975568]  ? commit_timeout+0x20/0x20
[  846.976555]  ? kthread_create_worker_on_cpu+0x70/0x70
[  846.977751]  ? do_syscall_64+0x73/0x130
[  846.978746]  ? SyS_exit_group+0x14/0x20
[  846.979736]  ret_from_fork+0x35/0x40
[  846.980747] INFO: task systemd-journal:236 blocked for more than 120 seconds.
[  846.982049]       Tainted: P           O     4.15.18-4-pve #1
[  846.983182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.984707] systemd-journal D    0   236      1 0x00000120
[  846.986238] Call Trace:
[  846.987172]  __schedule+0x3e0/0x870
[  846.988109]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[  846.989188]  schedule+0x36/0x80
[  846.990073]  cv_wait_common+0x11e/0x140 [spl]
[  846.991083]  ? wait_woken+0x80/0x80
[  846.992044]  __cv_wait+0x15/0x20 [spl]
[  846.993046]  txg_wait_open+0xb0/0x100 [zfs]
[  846.994018]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.994985]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.996196]  zvol_write+0x175/0x620 [zfs]
[  846.997469]  zvol_request+0x24a/0x300 [zfs]
[  846.998562]  ? SyS_madvise+0xa20/0xa20
[  846.999542]  generic_make_request+0x123/0x2f0
[  847.000637]  submit_bio+0x73/0x140
[  847.001711]  ? submit_bio+0x73/0x140
[  847.002674]  ? get_swap_bio+0xcf/0x100
[  847.003655]  __swap_writepage+0x345/0x3b0
[  847.004788]  ? __frontswap_store+0x73/0x100
[  847.005816]  swap_writepage+0x34/0x90
[  847.006733]  pageout.isra.53+0x1e5/0x330
[  847.007692]  shrink_page_list+0x955/0xb70
[  847.008845]  shrink_inactive_list+0x256/0x5e0
[  847.010100]  ? next_arg+0x80/0x110
[  847.011214]  shrink_node_memcg+0x365/0x780
[  847.012477]  shrink_node+0xe1/0x310
[  847.013530]  ? shrink_node+0xe1/0x310
[  847.014567]  do_try_to_free_pages+0xef/0x360
[  847.015686]  try_to_free_pages+0xf2/0x1b0
[  847.016894]  __alloc_pages_slowpath+0x401/0xf10
[  847.017958]  ? __page_cache_alloc+0x86/0x90
[  847.018954]  __alloc_pages_nodemask+0x25b/0x280
[  847.020080]  alloc_pages_current+0x6a/0xe0
[  847.021240]  __page_cache_alloc+0x86/0x90
[  847.022334]  filemap_fault+0x369/0x740
[  847.023299]  ? page_add_file_rmap+0xf7/0x150
[  847.024324]  ? filemap_map_pages+0x369/0x380
[  847.025381]  ext4_filemap_fault+0x31/0x44
[  847.026388]  __do_fault+0x24/0xe3
[  847.027337]  __handle_mm_fault+0xcd7/0x11e0
[  847.028382]  ? ep_read_events_proc+0xd0/0xd0
[  847.029472]  handle_mm_fault+0xce/0x1b0
[  847.030504]  __do_page_fault+0x25e/0x500
[  847.031518]  ? wake_up_q+0x80/0x80
[  847.032505]  do_page_fault+0x2e/0xe0
[  847.033544]  ? async_page_fault+0x2f/0x50
[  847.034608]  do_async_page_fault+0x1a/0x80
[  847.035740]  async_page_fault+0x45/0x50
[  847.036859] RIP: 0033:0x7f5e14a430c3
[  847.037888] RSP: 002b:00007ffde5fb7fd0 EFLAGS: 00010293
[  847.039163] RAX: 0000000000000001 RBX: 000056164a08f1e0 RCX: 00007f5e14a430c3
[  847.040736] RDX: 000000000000001c RSI: 00007ffde5fb7fe0 RDI: 0000000000000008
[  847.042378] RBP: 00007ffde5fb8230 R08: 431bde82d7b634db R09: 000000f966b2e318
[  847.043991] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007ffde5fb7fe0
[  847.045640] R13: 0000000000000001 R14: ffffffffffffffff R15: 0005750c02d60440
[  847.047266] INFO: task systemd-udevd:255 blocked for more than 120 seconds.
[  847.048893]       Tainted: P           O     4.15.18-4-pve #1
[  847.050532] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.052367] systemd-udevd   D    0   255      1 0x00000120
[  847.053924] Call Trace:
[  847.054897]  __schedule+0x3e0/0x870
[  847.056031]  schedule+0x36/0x80
[  847.057129]  cv_wait_common+0x11e/0x140 [spl]
[  847.058391]  ? wait_woken+0x80/0x80
[  847.059493]  __cv_wait+0x15/0x20 [spl]
[  847.060683]  txg_wait_open+0xb0/0x100 [zfs]
[  847.061979]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.063203]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.064447]  ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[  847.065775]  dmu_sync_late_arrival+0x53/0x150 [zfs]
[  847.067099]  ? dmu_write_policy+0xca/0x350 [zfs]
[  847.068394]  dmu_sync+0x3b4/0x490 [zfs]
[  847.069756]  ? dbuf_hold+0x33/0x60 [zfs]
[  847.070998]  ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[  847.072266]  zvol_get_data+0x164/0x180 [zfs]
[  847.073532]  zil_commit.part.14+0x451/0x8b0 [zfs]
[  847.074855]  zil_commit+0x17/0x20 [zfs]
[  847.076040]  zvol_write+0x5a2/0x620 [zfs]
[  847.077285]  zvol_request+0x24a/0x300 [zfs]
[  847.078528]  ? SyS_madvise+0xa20/0xa20
[  847.079677]  generic_make_request+0x123/0x2f0
[  847.080963]  submit_bio+0x73/0x140
[  847.082058]  ? submit_bio+0x73/0x140
[  847.083093]  ? get_swap_bio+0xcf/0x100
[  847.084135]  __swap_writepage+0x345/0x3b0
[  847.085284]  ? __frontswap_store+0x73/0x100
[  847.086424]  swap_writepage+0x34/0x90
[  847.087471]  pageout.isra.53+0x1e5/0x330
[  847.088565]  shrink_page_list+0x955/0xb70
[  847.089688]  shrink_inactive_list+0x256/0x5e0
[  847.090715]  ? next_arg+0x80/0x110
[  847.091670]  shrink_node_memcg+0x365/0x780
[  847.092808]  shrink_node+0xe1/0x310
[  847.093789]  ? shrink_node+0xe1/0x310
[  847.094739]  do_try_to_free_pages+0xef/0x360
[  847.095766]  try_to_free_pages+0xf2/0x1b0
[  847.096834]  __alloc_pages_slowpath+0x401/0xf10
[  847.097863]  ? security_file_open+0x90/0xa0
[  847.098852]  ? terminate_walk+0x91/0xf0
[  847.099919]  __alloc_pages_nodemask+0x25b/0x280
[  847.101087]  alloc_pages_current+0x6a/0xe0
[  847.102101]  __get_free_pages+0xe/0x30
[  847.103044]  pgd_alloc+0x1e/0x170
[  847.103941]  mm_init+0x197/0x280
[  847.104935]  copy_process.part.35+0xa50/0x1ab0
[  847.106029]  ? __seccomp_filter+0x49/0x540
[  847.107024]  ? _raw_spin_unlock_bh+0x1e/0x20
[  847.108080]  _do_fork+0xdf/0x3f0
[  847.109080]  ? __secure_computing+0x3f/0x100
[  847.110242]  ? syscall_trace_enter+0xca/0x2e0
[  847.111348]  SyS_clone+0x19/0x20
[  847.112435]  do_syscall_64+0x73/0x130
[  847.113606]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.114968] RIP: 0033:0x7f7ae276a38b
[  847.116137] RSP: 002b:00007ffea1b4b950 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[  847.117728] RAX: ffffffffffffffda RBX: 00007ffea1b4b950 RCX: 00007f7ae276a38b
[  847.119429] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  847.121162] RBP: 00007ffea1b4b9a0 R08: 00007f7ae39148c0 R09: 0000000000000210
[  847.122879] R10: 00007f7ae3914b90 R11: 0000000000000246 R12: 0000000000000000
[  847.124640] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[  847.126237] INFO: task systemd-timesyn:2623 blocked for more than 120 seconds.
[  847.127856]       Tainted: P           O     4.15.18-4-pve #1
[  847.129293] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.130993] systemd-timesyn D    0  2623      1 0x00000120
[  847.132390] Call Trace:
[  847.133404]  __schedule+0x3e0/0x870
[  847.134513]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[  847.135783]  schedule+0x36/0x80
[  847.136876]  cv_wait_common+0x11e/0x140 [spl]
[  847.138128]  ? wait_woken+0x80/0x80
[  847.139215]  __cv_wait+0x15/0x20 [spl]
[  847.140503]  txg_wait_open+0xb0/0x100 [zfs]
[  847.141783]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.142980]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.144255]  zvol_write+0x175/0x620 [zfs]
[  847.145531]  ? avl_find+0x5f/0xa0 [zavl]
[  847.146818]  zvol_request+0x24a/0x300 [zfs]
[  847.166365]  ? SyS_madvise+0xa20/0xa20
[  847.167706]  generic_make_request+0x123/0x2f0
[  847.169118]  submit_bio+0x73/0x140
[  847.170263]  ? submit_bio+0x73/0x140
[  847.171385]  ? get_swap_bio+0xcf/0x100
[  847.172468]  __swap_writepage+0x345/0x3b0
[  847.173624]  ? __frontswap_store+0x73/0x100
[  847.174690]  swap_writepage+0x34/0x90
[  847.175697]  pageout.isra.53+0x1e5/0x330
[  847.176840]  shrink_page_list+0x955/0xb70
[  847.177948]  shrink_inactive_list+0x256/0x5e0
[  847.179101]  ? next_arg+0x80/0x110
[  847.180086]  shrink_node_memcg+0x365/0x780
[  847.181168]  shrink_node+0xe1/0x310
[  847.182126]  ? shrink_node+0xe1/0x310
[  847.183114]  do_try_to_free_pages+0xef/0x360
[  847.184216]  try_to_free_pages+0xf2/0x1b0
[  847.185362]  __alloc_pages_slowpath+0x401/0xf10
[  847.186524]  ? dev_queue_xmit+0x10/0x20
[  847.187576]  ? br_dev_queue_push_xmit+0x7a/0x140
[  847.188691]  __alloc_pages_nodemask+0x25b/0x280
[  847.189891]  alloc_pages_vma+0x88/0x1c0
[  847.190975]  __read_swap_cache_async+0x147/0x200
[  847.192166]  read_swap_cache_async+0x2b/0x60
[  847.193337]  swapin_readahead+0x22f/0x2b0
[  847.194369]  ? radix_tree_lookup_slot+0x22/0x50
[  847.195408]  ? find_get_entry+0x1e/0x100
[  847.196396]  ? pagecache_get_page+0x2c/0x2b0
[  847.197482]  do_swap_page+0x52d/0x9b0
[  847.198517]  ? do_swap_page+0x52d/0x9b0
[  847.199527]  ? crypto_shash_update+0x47/0x130
[  847.200539]  __handle_mm_fault+0x88d/0x11e0
[  847.201564]  ? jbd2_journal_dirty_metadata+0x22d/0x290
[  847.202735]  handle_mm_fault+0xce/0x1b0
[  847.203761]  __do_page_fault+0x25e/0x500
[  847.204766]  ? __switch_to_asm+0x34/0x70
[  847.205728]  do_page_fault+0x2e/0xe0
[  847.206648]  do_async_page_fault+0x1a/0x80
[  847.207715]  async_page_fault+0x25/0x50
[  847.208725] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[  847.209764] RSP: 0000:ffffad16810c3d80 EFLAGS: 00010246
[  847.210821] RAX: 0000000000000001 RBX: ffffad16810c3df8 RCX: 00007fffc09eaec0
[  847.212099] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[  847.213504] RBP: ffffad16810c3dd8 R08: 00000000000002c6 R09: ffff9d25fb9f5f98
[  847.214822] R10: ffffad16810c3cc0 R11: 0000000000000038 R12: 0000000000000000
[  847.216193] R13: ffffad16810c3e78 R14: ffff9d25fb9b89c0 R15: ffff9d25fb9f5f98
[  847.217572]  ? ep_read_events_proc+0xd0/0xd0
[  847.218601]  ep_scan_ready_list.constprop.18+0x9e/0x210
[  847.219743]  ep_poll+0x1f8/0x3b0
[  847.220735]  ? wake_up_q+0x80/0x80
[  847.221659]  SyS_epoll_wait+0xce/0xf0
[  847.222683]  do_syscall_64+0x73/0x130
[  847.223681]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.224859] RIP: 0033:0x7fe0d40920c3
[  847.225852] RSP: 002b:00007fffc09eaeb0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[  847.227252] RAX: ffffffffffffffda RBX: 000056492a0138f0 RCX: 00007fe0d40920c3
[  847.228575] RDX: 0000000000000006 RSI: 00007fffc09eaec0 RDI: 0000000000000004
[  847.229952] RBP: 00007fffc09eb010 R08: 431bde82d7b634db R09: 00000000ffffffff
[  847.231397] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007fffc09eaec0
[  847.232823] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000000
[  847.234199] INFO: task rpcbind:2624 blocked for more than 120 seconds.
[  847.235669]       Tainted: P           O     4.15.18-4-pve #1
[  847.237042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.242089] rpcbind         D    0  2624      1 0x00000000
[  847.244049] Call Trace:
[  847.245324]  __schedule+0x3e0/0x870
[  847.246883]  schedule+0x36/0x80
[  847.247957]  cv_wait_common+0x11e/0x140 [spl]
[  847.249063]  ? wait_woken+0x80/0x80
[  847.250169]  __cv_wait+0x15/0x20 [spl]
[  847.251423]  txg_wait_open+0xb0/0x100 [zfs]
[  847.252706]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.254264]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.255518]  zvol_write+0x175/0x620 [zfs]
[  847.256706]  zvol_request+0x24a/0x300 [zfs]
[  847.258092]  ? SyS_madvise+0xa20/0xa20
[  847.259682]  generic_make_request+0x123/0x2f0
[  847.261515]  submit_bio+0x73/0x140
[  847.263119]  ? submit_bio+0x73/0x140
[  847.264676]  ? get_swap_bio+0xcf/0x100
[  847.266047]  __swap_writepage+0x345/0x3b0
[  847.267622]  ? __frontswap_store+0x73/0x100
[  847.269369]  swap_writepage+0x34/0x90
[  847.270991]  pageout.isra.53+0x1e5/0x330
[  847.272710]  shrink_page_list+0x955/0xb70
[  847.274195]  shrink_inactive_list+0x256/0x5e0
[  847.275618]  ? next_arg+0x80/0x110
[  847.276909]  shrink_node_memcg+0x365/0x780
[  847.278354]  shrink_node+0xe1/0x310
[  847.279636]  ? shrink_node+0xe1/0x310
[  847.280971]  do_try_to_free_pages+0xef/0x360
[  847.282448]  try_to_free_pages+0xf2/0x1b0
[  847.283825]  __alloc_pages_slowpath+0x401/0xf10
[  847.285329]  ? __hrtimer_init+0xa0/0xa0
[  847.286574]  ? poll_freewait+0x4a/0xb0
[  847.287577]  __alloc_pages_nodemask+0x25b/0x280
[  847.288654]  alloc_pages_current+0x6a/0xe0
[  847.289718]  __page_cache_alloc+0x86/0x90
[  847.290707]  __do_page_cache_readahead+0x10e/0x2d0
[  847.291834]  ? radix_tree_lookup_slot+0x22/0x50
[  847.293072]  ? __intel_pmu_enable_all.constprop.19+0x4d/0x80
[  847.294357]  ? find_get_entry+0x1e/0x100
[  847.295449]  filemap_fault+0x571/0x740
[  847.296492]  ? filemap_fault+0x571/0x740
[  847.297612]  ? filemap_map_pages+0x180/0x380
[  847.298682]  ext4_filemap_fault+0x31/0x44
[  847.299784]  __do_fault+0x24/0xe3
[  847.300829]  __handle_mm_fault+0xcd7/0x11e0
[  847.301910]  handle_mm_fault+0xce/0x1b0
[  847.302926]  __do_page_fault+0x25e/0x500
[  847.303946]  ? ktime_get_ts64+0x51/0xf0
[  847.305059]  do_page_fault+0x2e/0xe0
[  847.306133]  ? async_page_fault+0x2f/0x50
[  847.307158]  do_async_page_fault+0x1a/0x80
[  847.308292]  async_page_fault+0x45/0x50
[  847.309415] RIP: 0033:0x7fc96419d660
[  847.310441] RSP: 002b:00007fff1212c788 EFLAGS: 00010246
[  847.311917] RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007fc96419d660
[  847.313598] RDX: 0000000000007530 RSI: 0000000000000007 RDI: 00007fff1212c9d0
[  847.315185] RBP: 0000000000000007 R08: 00000000000000c3 R09: 000000000000000b
[  847.316666] R10: 00007fff1212c700 R11: 0000000000000246 R12: 000000000000000c
[  847.318208] R13: 000000000000000c R14: 0000558af8c4d480 R15: 00007fff1212c810
[  847.319523] INFO: task systemd-logind:2759 blocked for more than 120 seconds.
[  847.320812]       Tainted: P           O     4.15.18-4-pve #1
[  847.322042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.323441] systemd-logind  D    0  2759      1 0x00000120
[  847.324575] Call Trace:
[  847.325493]  __schedule+0x3e0/0x870
[  847.326482]  schedule+0x36/0x80
[  847.327482]  cv_wait_common+0x11e/0x140 [spl]
[  847.328683]  ? wait_woken+0x80/0x80
[  847.329778]  __cv_wait+0x15/0x20 [spl]
[  847.330970]  txg_wait_open+0xb0/0x100 [zfs]
[  847.332121]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.333402]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.334552]  zvol_write+0x175/0x620 [zfs]
[  847.335630]  zvol_request+0x24a/0x300 [zfs]
[  847.336841]  ? SyS_madvise+0xa20/0xa20
[  847.337971]  generic_make_request+0x123/0x2f0
[  847.339185]  submit_bio+0x73/0x140
[  847.340308]  ? submit_bio+0x73/0x140
[  847.341486]  ? get_swap_bio+0xcf/0x100
[  847.342568]  __swap_writepage+0x345/0x3b0
[  847.343733]  ? __frontswap_store+0x73/0x100
[  847.344923]  swap_writepage+0x34/0x90
[  847.345982]  pageout.isra.53+0x1e5/0x330
[  847.347115]  shrink_page_list+0x955/0xb70
[  847.348290]  shrink_inactive_list+0x256/0x5e0
[  847.349570]  ? virtqueue_add_sgs+0x3b0/0x490
[  847.350704]  ? next_arg+0x80/0x110
[  847.351733]  shrink_node_memcg+0x365/0x780
[  847.352902]  shrink_node+0xe1/0x310
[  847.353928]  ? shrink_node+0xe1/0x310
[  847.355200]  do_try_to_free_pages+0xef/0x360
[  847.356538]  try_to_free_pages+0xf2/0x1b0
[  847.357638]  __alloc_pages_slowpath+0x401/0xf10
[  847.358760]  ? __update_load_avg_se.isra.38+0x1bc/0x1d0
[  847.360134]  ? __switch_to_asm+0x40/0x70
[  847.361353]  ? __switch_to_asm+0x34/0x70
[  847.362495]  ? __switch_to_asm+0x34/0x70
[  847.363611]  ? __switch_to_asm+0x40/0x70
[  847.364668]  __alloc_pages_nodemask+0x25b/0x280
[  847.365826]  alloc_pages_vma+0x88/0x1c0
[  847.366881]  __read_swap_cache_async+0x147/0x200
[  847.368331]  read_swap_cache_async+0x2b/0x60
[  847.369577]  swapin_readahead+0x22f/0x2b0
[  847.370702]  ? radix_tree_lookup_slot+0x22/0x50
[  847.371913]  ? find_get_entry+0x1e/0x100
[  847.373118]  ? pagecache_get_page+0x2c/0x2b0
[  847.374463]  do_swap_page+0x52d/0x9b0
[  847.375610]  ? do_swap_page+0x52d/0x9b0
[  847.376802]  ? __wake_up_sync_key+0x1e/0x30
[  847.378066]  __handle_mm_fault+0x88d/0x11e0
[  847.379291]  handle_mm_fault+0xce/0x1b0
[  847.380417]  __do_page_fault+0x25e/0x500
[  847.381577]  ? __switch_to_asm+0x34/0x70
[  847.382513]  do_page_fault+0x2e/0xe0
[  847.383370]  do_async_page_fault+0x1a/0x80
[  847.384302]  async_page_fault+0x25/0x50
[  847.385225] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[  847.386258] RSP: 0000:ffffad168108bd80 EFLAGS: 00010246
[  847.387335] RAX: 0000000000000001 RBX: ffffad168108bdf8 RCX: 00007fffea7798b0
[  847.388657] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[  847.390106] RBP: ffffad168108bdd8 R08: 000000000000030a R09: ffff9d25f5dfc218
[  847.391442] R10: ffffad168108bcc0 R11: 0000000000000000 R12: 0000000000000000
[  847.392743] R13: ffffad168108be78 R14: ffff9d25faea20c0 R15: ffff9d25f5dfc218
[  847.394189]  ? ep_read_events_proc+0xd0/0xd0
[  847.395244]  ep_scan_ready_list.constprop.18+0x9e/0x210
[  847.396415]  ep_poll+0x1f8/0x3b0
[  847.397387]  ? wake_up_q+0x80/0x80
[  847.398331]  SyS_epoll_wait+0xce/0xf0
[  847.399313]  do_syscall_64+0x73/0x130
[  847.400232]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.401329] RIP: 0033:0x7fc20e6080a3
[  847.402292] RSP: 002b:00007fffea7798a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[  847.403817] RAX: ffffffffffffffda RBX: 0000560a68ae1260 RCX: 00007fc20e6080a3
[  847.405311] RDX: 000000000000000d RSI: 00007fffea7798b0 RDI: 0000000000000004
[  847.406765] RBP: 00007fffea779a50 R08: 0000560a68aed6e0 R09: 0000000000000000
[  847.408354] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007fffea7798b0
[  847.410017] R13: 0000000000000001 R14: ffffffffffffffff R15: 00007fffea779b50
[  847.411383] INFO: task qemu-ga:2842 blocked for more than 120 seconds.
[  847.413009]       Tainted: P           O     4.15.18-4-pve #1
[  847.414513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.416280] qemu-ga         D    0  2842      1 0x00000000
[  847.417779] Call Trace:
[  847.418769]  __schedule+0x3e0/0x870
[  847.419916]  schedule+0x36/0x80
[  847.421044]  cv_wait_common+0x11e/0x140 [spl]
[  847.422126]  ? wait_woken+0x80/0x80
[  847.423130]  __cv_wait+0x15/0x20 [spl]
[  847.424295]  zil_commit.part.14+0x86/0x8b0 [zfs]
[  847.425726]  ? spl_kmem_free+0x33/0x40 [spl]
[  847.427098]  ? zfs_range_unlock+0x1b3/0x2e0 [zfs]
[  847.428477]  zil_commit+0x17/0x20 [zfs]
[  847.429738]  zvol_write+0x5a2/0x620 [zfs]
[  847.430951]  ? avl_find+0x5f/0xa0 [zavl]
[  847.432198]  zvol_request+0x24a/0x300 [zfs]
[  847.433506]  ? SyS_madvise+0xa20/0xa20
[  847.434707]  generic_make_request+0x123/0x2f0
[  847.435984]  submit_bio+0x73/0x140
[  847.437152]  ? submit_bio+0x73/0x140
[  847.438330]  ? get_swap_bio+0xcf/0x100
[  847.439487]  __swap_writepage+0x345/0x3b0
[  847.440700]  ? __frontswap_store+0x73/0x100
[  847.441983]  swap_writepage+0x34/0x90
[  847.443175]  pageout.isra.53+0x1e5/0x330
[  847.444443]  shrink_page_list+0x955/0xb70
[  847.445809]  shrink_inactive_list+0x256/0x5e0
[  847.447055]  ? next_arg+0x80/0x110
[  847.448177]  shrink_node_memcg+0x365/0x780
[  847.449443]  shrink_node+0xe1/0x310
[  847.450563]  ? shrink_node+0xe1/0x310
[  847.451733]  do_try_to_free_pages+0xef/0x360
[  847.452995]  try_to_free_pages+0xf2/0x1b0
[  847.454159]  __alloc_pages_slowpath+0x401/0xf10
[  847.455441]  ? __page_cache_alloc+0x86/0x90
[  847.456679]  __alloc_pages_nodemask+0x25b/0x280
[  847.457968]  alloc_pages_current+0x6a/0xe0
[  847.459194]  __page_cache_alloc+0x86/0x90
[  847.460374]  filemap_fault+0x369/0x740
[  847.461542]  ? __switch_to_asm+0x40/0x70
[  847.462696]  ? __switch_to_asm+0x34/0x70
[  847.463843]  ? filemap_map_pages+0x180/0x380
[  847.465061]  ? __switch_to_asm+0x34/0x70
[  847.466211]  ext4_filemap_fault+0x31/0x44
[  847.467346]  __do_fault+0x24/0xe3
[  847.468398]  __handle_mm_fault+0xcd7/0x11e0
[  847.469631]  handle_mm_fault+0xce/0x1b0
[  847.470750]  __do_page_fault+0x25e/0x500
[  847.471897]  do_page_fault+0x2e/0xe0
[  847.472997]  ? async_page_fault+0x2f/0x50
[  847.474089]  do_async_page_fault+0x1a/0x80
[  847.475205]  async_page_fault+0x45/0x50
[  847.476285] RIP: 0033:0x55c573934927
[  847.477358] RSP: 002b:00007ffee4db6630 EFLAGS: 00010206
[  847.478611] RAX: 0000000000000000 RBX: 000055c575723e20 RCX: 00007ff976bbc270
[  847.480176] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007ffee4db6610
[  847.481798] RBP: 00007ffee4db6640 R08: 000055c57572a0c0 R09: 0000000000000000
[  847.483337] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c57572a9a0
[  847.484893] R13: 000055c57572a2a0 R14: 000055c57572ad60 R15: 00007ff97766b610
[  847.486492] INFO: task rrdcached:3267 blocked for more than 120 seconds.
[  847.487976]       Tainted: P           O     4.15.18-4-pve #1
[  847.489368] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.491033] rrdcached       D    0  3267      1 0x00000000
[  847.492429] Call Trace:
[  847.493452]  __schedule+0x3e0/0x870
[  847.494556]  schedule+0x36/0x80
[  847.495609]  cv_wait_common+0x11e/0x140 [spl]
[  847.496841]  ? wait_woken+0x80/0x80
[  847.497976]  __cv_wait+0x15/0x20 [spl]
[  847.499154]  txg_wait_open+0xb0/0x100 [zfs]
[  847.500355]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.501566]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.502758]  zvol_write+0x175/0x620 [zfs]
[  847.503976]  zvol_request+0x24a/0x300 [zfs]
[  847.505245]  ? SyS_madvise+0xa20/0xa20
[  847.506371]  generic_make_request+0x123/0x2f0
[  847.507549]  submit_bio+0x73/0x140
[  847.508608]  ? submit_bio+0x73/0x140
[  847.509801]  ? get_swap_bio+0xcf/0x100
[  847.510921]  __swap_writepage+0x345/0x3b0
[  847.512140]  ? __frontswap_store+0x73/0x100
[  847.513384]  swap_writepage+0x34/0x90
[  847.514525]  pageout.isra.53+0x1e5/0x330
[  847.515632]  shrink_page_list+0x955/0xb70
[  847.516772]  shrink_inactive_list+0x256/0x5e0
[  847.518015]  ? syscall_return_via_sysret+0x1a/0x7a
[  847.519363]  ? next_arg+0x80/0x110
[  847.520513]  shrink_node_memcg+0x365/0x780
[  847.521717]  shrink_node+0xe1/0x310
[  847.522846]  ? shrink_node+0xe1/0x310
[  847.523949]  do_try_to_free_pages+0xef/0x360
[  847.525182]  try_to_free_pages+0xf2/0x1b0
[  847.526436]  __alloc_pages_slowpath+0x401/0xf10
[  847.527735]  ? __page_cache_alloc+0x86/0x90
[  847.528986]  __alloc_pages_nodemask+0x25b/0x280
[  847.530283]  alloc_pages_current+0x6a/0xe0
[  847.531307]  __page_cache_alloc+0x86/0x90
[  847.532306]  filemap_fault+0x369/0x740
[  847.533308]  ? filemap_map_pages+0x27a/0x380
[  847.534323]  ext4_filemap_fault+0x31/0x44
[  847.535333]  __do_fault+0x24/0xe3
[  847.536304]  __handle_mm_fault+0xcd7/0x11e0
[  847.537400]  ? update_cfs_group+0xc4/0xe0
[  847.538423]  handle_mm_fault+0xce/0x1b0
[  847.539484]  __do_page_fault+0x25e/0x500
[  847.540584]  ? ktime_get_ts64+0x51/0xf0
[  847.541690]  do_page_fault+0x2e/0xe0
[  847.542714]  ? async_page_fault+0x2f/0x50
[  847.543785]  do_async_page_fault+0x1a/0x80
[  847.544885]  async_page_fault+0x45/0x50
[  847.545894] RIP: 0033:0x7f40f2d6667d
[  847.546870] RSP: 002b:00007fffcec77460 EFLAGS: 00010293
[  847.548075] RAX: 0000000000000000 RBX: 0000556934dc0c10 RCX: 00007f40f2d6667d
[  847.549599] RDX: 00000000000003e8 RSI: 0000000000000001 RDI: 0000556934dae6e0
[  847.551060] RBP: 0000000000000001 R08: 00007f40e9f24700 R09: 00007f40e9f24700
[  847.552554] R10: 00000000000008ba R11: 0000000000000293 R12: 0000556934c28360
[  847.554038] R13: 0000556934dae6e0 R14: 0000000000000000 R15: 0000000000000001

behlendorf · 2018-09-05T00:15:16Z

If possible could you dump the stack from the txg_sync process when this happens.

siv0 · 2018-09-05T13:32:23Z

Managed to get the following by running sleep 60; echo t > /proc/sysrq-trigger, before starting stress:

[10981.102235] txg_sync        R  running task        0   606      2 0x80000000
[10981.103613] Call Trace:
[10981.104615]  <IRQ>
[10981.105989]  sched_show_task+0xfe/0x130
[10981.107636]  show_state_filter+0x62/0xd0
[10981.108730]  sysrq_handle_showstate+0x10/0x20
[10981.109813]  __handle_sysrq+0x106/0x170
[10981.110859]  sysrq_filter+0x9b/0x390
[10981.111841]  input_to_handler+0x61/0x100
[10981.112901]  input_pass_values.part.6+0x117/0x130
[10981.114128]  input_handle_event+0x137/0x510
[10981.115171]  input_event+0x54/0x80
[10981.116220]  atkbd_interrupt+0x4b6/0x7a0
[10981.117176]  serio_interrupt+0x4c/0x90
[10981.118211]  i8042_interrupt+0x1f4/0x3b0
[10981.119180]  __handle_irq_event_percpu+0x84/0x1a0
[10981.120324]  handle_irq_event_percpu+0x32/0x80
[10981.121878]  handle_irq_event+0x3b/0x60
[10981.123547]  handle_edge_irq+0x78/0x1a0
[10981.124862]  handle_irq+0x20/0x30
[10981.125899]  do_IRQ+0x4e/0xd0
[10981.126730]  common_interrupt+0x84/0x84
[10981.127697]  </IRQ>
[10981.128422] RIP: 0010:page_vma_mapped_walk+0x49e/0x770
[10981.129488] RSP: 0000:ffffc3310106f188 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd6
[10981.130916] RAX: ffff9e7c795f6a80 RBX: ffffc3310106f1d8 RCX: ffff9e7c40000a80
[10981.132363] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 0000000000000000
[10981.133810] RBP: ffffc3310106f1b8 R08: 0000000000000067 R09: 0000000000000000
[10981.135262] R10: 0000000000000000 R11: 0000000000000091 R12: ffffe7bd40949440
[10981.136754] R13: ffff9e7c79468cc0 R14: 0000000000000988 R15: ffffc3310106f1d8
[10981.138596]  ? page_vma_mapped_walk+0x204/0x770
[10981.140413]  page_referenced_one+0x91/0x190
[10981.141780]  rmap_walk_anon+0x113/0x270
[10981.142918]  rmap_walk+0x48/0x60
[10981.143998]  page_referenced+0x10d/0x170
[10981.145189]  ? rmap_walk_anon+0x270/0x270
[10981.146331]  ? page_get_anon_vma+0x80/0x80
[10981.147523]  shrink_page_list+0x791/0xb70
[10981.148737]  shrink_inactive_list+0x256/0x5e0
[10981.150007]  shrink_node_memcg+0x365/0x780
[10981.151191]  ? _cond_resched+0x1a/0x50
[10981.152316]  shrink_node+0xe1/0x310
[10981.153384]  ? shrink_node+0xe1/0x310
[10981.154739]  do_try_to_free_pages+0xef/0x360
[10981.155776]  try_to_free_pages+0xf2/0x1b0
[10981.156747]  __alloc_pages_slowpath+0x401/0xf10
[10981.157910]  __alloc_pages_nodemask+0x25b/0x280
[10981.159112]  alloc_pages_current+0x6a/0xe0
[10981.160480]  new_slab+0x317/0x690
[10981.161439]  ? __blk_mq_get_tag+0x21/0x90
[10981.162485]  ___slab_alloc+0x3c1/0x4e0
[10981.163458]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.164520]  ? update_load_avg+0x665/0x700
[10981.165545]  ? update_load_avg+0x665/0x700
[10981.166545]  __slab_alloc+0x20/0x40
[10981.167472]  ? __enqueue_entity+0x5c/0x60
[10981.168405]  ? __slab_alloc+0x20/0x40
[10981.169297]  __kmalloc_node+0xe4/0x2b0
[10981.170362]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.171331]  spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.172380]  dbuf_dirty+0x20c/0x880 [zfs]
[10981.173440]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.174606]  dbuf_dirty+0x74e/0x880 [zfs]
[10981.175719]  dnode_setdirty+0xbe/0x100 [zfs]
[10981.176841]  dbuf_dirty+0x46c/0x880 [zfs]
[10981.177809]  dmu_buf_will_dirty+0x11c/0x130 [zfs]
[10981.178814]  dsl_dataset_sync+0x26/0x230 [zfs]
[10981.179801]  dsl_pool_sync+0x9f/0x430 [zfs]
[10981.180747]  spa_sync+0x42d/0xd50 [zfs]
[10981.181653]  txg_sync_thread+0x2d4/0x4a0 [zfs]
[10981.182655]  ? finish_task_switch+0x74/0x200
[10981.183603]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[10981.184578]  thread_generic_wrapper+0x74/0x90 [spl]
[10981.185537]  kthread+0x105/0x140
[10981.186386]  ? __thread_exit+0x20/0x20 [spl]
[10981.187333]  ? kthread_create_worker_on_cpu+0x70/0x70
[10981.188470]  ? do_syscall_64+0x73/0x130
[10981.189420]  ? kthread_create_worker_on_cpu+0x70/0x70
[10981.190450]  ret_from_fork+0x35/0x40

dweeezil · 2018-09-07T16:14:24Z

I decided to give this a try and, as advertised, the deadlock is quite easy to reproduce. First, in a 4.14 kernel, it's interesting to note that a bunch of the "page allocation stalls for ..." messages are generated. That particular message, however, was removed in later kernels because it was determined that merely trying to print the message could, itself, cause deadlocks.

In a 4.16 kernel, which does not have the message, the processes are in quite a few different states. Among other things, an interesting observation is that the "zvol" tasks are generally in a running (R) state. Most of the regular user processes are in various states of dormancy (D). This sshd process is typical ("?" lines elided for brevity):

[ 6212.296211] sshd            D    0  3018   2976 0x00000000
[ 6212.296211] Call Trace:
[ 6212.296211]  schedule+0x32/0x80
[ 6212.296211]  schedule_timeout+0x159/0x330
[ 6212.296211]  io_schedule_timeout+0x19/0x40
[ 6212.296211]  mempool_alloc+0x121/0x150
[ 6212.296211]  bvec_alloc+0x86/0xe0
[ 6212.296211]  bio_alloc_bioset+0x153/0x200
[ 6212.296211]  get_swap_bio+0x40/0xd0
[ 6212.296211]  __swap_writepage+0x292/0x360
[ 6212.296211]  pageout.isra.51+0x1bb/0x300
[ 6212.296211]  shrink_page_list+0x959/0xbd0
[ 6212.296211]  shrink_inactive_list+0x2a8/0x630
[ 6212.296211]  shrink_node_memcg+0x37f/0x7d0
[ 6212.296211]  shrink_node+0xcc/0x300
[ 6212.296211]  do_try_to_free_pages+0xdb/0x330
[ 6212.296211]  try_to_free_pages+0xd5/0x1a0
[ 6212.296211]  __alloc_pages_slowpath+0x37c/0xd80
[ 6212.296211]  __alloc_pages_nodemask+0x227/0x260
[ 6212.296211]  alloc_pages_vma+0x7c/0x1e0
[ 6212.296211]  __read_swap_cache_async+0x149/0x1c0
[ 6212.296211]  do_swap_page_readahead+0xc9/0x1a0
[ 6212.296211]  do_swap_page+0x5a5/0x820
[ 6212.296211]  __handle_mm_fault+0x91a/0x1040
[ 6212.296211]  handle_mm_fault+0xe4/0x1f0
[ 6212.296211]  __do_page_fault+0x242/0x4d0
[ 6212.296211]  do_page_fault+0x2c/0x110
[ 6212.296211]  page_fault+0x25/0x50

Other user dormant user processes such as agetty, etc. all have pretty much identical stacks.

As far as ZFS is concerned, in the current deadlocked system I've got running, the only blocked process is "txg_quiesce" with a rather uninteresting stack of:

[ 5891.836341] txg_quiesce     D    0  2457      2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341]  schedule+0x32/0x80
[ 5891.836341]  cv_wait_common+0x129/0x260 [spl]
[ 5891.836341]  txg_quiesce_thread+0x2fc/0x500 [zfs]
[ 5891.836341]  thread_generic_wrapper+0x76/0xb0 [spl]
[ 5891.836341]  kthread+0xf5/0x130
[ 5891.836341]  ret_from_fork+0x35/0x40

which is in the cv_wait(&tc->tc_cv[g], &tc->tc_lock); at the bottom of txg_quiesce().

The most interesting zfs-related thing is that all the zvol threads are spinning (in "R" state). They all appear to be in their while (ui.ui_resid > 0 ... loop calling dmu_write_uio_dnode() and show this typical stack:

[ 5891.836341] zvol            R  running task        0  3034      2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341]  ? page_referenced+0xb9/0x160
[ 5891.836341]  ? shrink_page_list+0x467/0xbd0
[ 5891.836341]  ? shrink_inactive_list+0x2b5/0x630
[ 5891.836341]  ? shrink_node_memcg+0x37f/0x7d0
[ 5891.836341]  ? __list_lru_init+0x170/0x200
[ 5891.836341]  ? shrink_node+0xcc/0x300
[ 5891.836341]  ? shrink_node+0xcc/0x300
[ 5891.836341]  ? do_try_to_free_pages+0xdb/0x330
[ 5891.836341]  ? try_to_free_pages+0xd5/0x1a0
[ 5891.836341]  ? __alloc_pages_slowpath+0x37c/0xd80
[ 5891.836341]  ? get_page_from_freelist+0x145/0x1190
[ 5891.836341]  ? __alloc_pages_nodemask+0x227/0x260
[ 5891.836341]  ? abd_alloc+0x1e4/0x4d0 [zfs]
[ 5891.836341]  ? arc_hdr_alloc_abd+0x5b/0x230 [zfs]
[ 5891.836341]  ? arc_hdr_alloc+0x161/0x390 [zfs]
[ 5891.836341]  ? arc_alloc_buf+0x38/0x110 [zfs]
[ 5891.836341]  ? dmu_buf_will_fill+0x122/0x3b0 [zfs]
[ 5891.836341]  ? dmu_write_uio_dnode+0xe1/0x1a0 [zfs]
[ 5891.836341]  ? zvol_write+0x191/0x6c0 [zfs]

I'm going to try a couple of things: First, set zvol_request_sync=1 to see what difference it might make. Second, add some instrumentation to the zvol path to see whether it might actually be making some forward progress, albeit very slowly.

cwedgwood · 2018-09-07T16:45:36Z

naively i don't see how swap can be expected to work as things are, i think we should document/label this as such

@behlendorf given the amount of code complexity and potentially allocations required is supporting swap even realistic?

behlendorf · 2018-09-07T18:49:41Z

@cwedgwood It's definitely been a challenging area to get working and is not heavily stress tested on a wide range of kernels. It was working reliably with older zfs releases and older kernels, but as you say given the code complexity and surface area we'd need to test it's not something I'd recommend. I tend to agree we should make this clearer in the documentation.

drescherjm · 2018-09-07T22:31:47Z

I am using this (swap on zfs) on several production systems (home and work) with recent kernels. Thankfully it has not been an issue for me yet.

inpos · 2018-09-17T13:42:19Z

Same issue on Gentoo with ZFS 0.8.0-rc1 and native encryption.

ryao · 2018-09-17T13:51:14Z

@inpos Out of curiosity, how did you get 0.8.0-rc1 on Gentoo? I have not finished my review of it, so I haven't pushed it to the main tree yet. I might just push it unkeyworded if people would otherwise be resort to building outside the package manager.

ryao · 2018-09-17T13:59:43Z

@behlendorf This was never 100% sane, although the return of the zvol threads certainly did not help. I started (but did not finish) writing a patch that could help with this if finished:

https://paste.pound-python.org/show/KWlvrHdBU2mA9ev2odXL/

At this point, it is fairly clear to me that my offline obligations prevent me from speculating on if/when I will finish that, but other developers should be able to see the idea there. It ought to help if/when finished. The current version will NOT compile, so I would prefer it if users did not attempt it. Trying to compile it would be a waste of their time.

Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap. This requires modifications to the code because the current codebase is not able to support a zvol in that configuration. Quite honestly, if we were going for the nuclear option, I'd prefer to create a new "dataset" type and implement extent based allocation.

@ahrens Not that I am seriously considering this, but it would be nice to hear your thoughts on the nuclear option for dealing with swap.

inpos · 2018-09-17T14:01:00Z

@ryao Here content of /usr/local/portage/sys-fs/zfs/zfs-0.8.0_rc1.ebuild

And here content of /usr/local/portage/sys-fs/zfs-kmod/zfs-kmod-0.8.0_rc1.ebuild

ryao · 2018-09-17T14:13:43Z

On second thought, given that swap devices are simultaneously dump devices on Linux, it might make sense to implement support for creating/writing to Illumos dump devices. That would probably be a good middle ground here.

prakashsurya · 2018-09-17T16:29:00Z

I'm not an expert in this area of the code, but I think that swap on ZVOL is inherently unreliable due to writes to the swap ZVOL having to go through the normal TXG sync and ZIO write paths, which can require lots of memory allocations by design (and these memory allocations can stall due to a low memory situation). I believe this to be true for swap on ZVOL for illumos, as well as Linux, and presumably FreeBSD too (although I have no experience using it on FreeBSD, so I could be wrong).

I think the "proper" way to address this is to mimic the write path for a ZVOL dump device on illumos, for the ZVOL swap device. i.e. preallocate the ZVOL's blocks on disk, then do a "direct write" to the preallocated LBA such that it doesn't go through the normal TXG sync and ZIO write code paths. This would significantly reduce the amount of work (e.g. memory allocations) that's required to write a swap page to disk, thereby increasing the reliability of that write.

The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).

I spent a couple hours reading the ZVOL dump device write codepaths during the OpenZFS dev summit last week, and I think this approach is viable. It'll require porting the code over to work on Linux, where we'll need to rework the code to work with the Linux block device layer, since (IIRC) that's what's used to issue the writes to disk (see zvol_dumpio_vdev() in illumos). Additionally, we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos); for illumos dump devices, this occurs via the DKIOCDUMPINIT ioctl (when zvol_dumpify() is called).

With all that said, I've only spent an hour or two looking into this and I don't have any prior experience with swap, so I may be over looking something or flat out wrong in my analysis so far. Also, I won't have any time in the near future to look into this in detail, but I could potentially help answer questions if somebody else wants to try to prototype the changes I'm proposing.

behlendorf · 2018-09-17T21:01:00Z

we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos);

I took a look at mkswap.c and aside from writing a unique signature to the device it doesn't do anything which would let ZFS easily identify it as a swap device. One solution would be to add a udev helper which detects ZFS volumes with the swap signature and calls zvol_prealloc() via an ioctl(). This should allow things to "just work" on most Linux distributions, though obviously it's Linux specific. We should probably do something a bit more generic and reliable and add a new create-time-only property for volumes. This wouldn't even necessarily prevent us from adding the udev hook in the future.

prakashsurya · 2018-09-17T21:15:16Z

@behlendorf that's unfortunate; given there's no good entry point to do the preallocation, your two suggestions seem reasonable at first glance.

Using a new ioctl (or create-time property) to preallocate and "convert" a ZVOL to a "direct write" ZVOL might be a neat feature, as it could enable this use case of swap, but also other use cases where the user would like to forego the COW semantics (e.g. a zpool layered on top of ZVOLs).

davidklaftenegger · 2018-12-07T22:22:03Z

Maybe this wiki page

https://github.com/zfsonlinux/pkg-zfs/wiki/HOWTO-use-a-zvol-as-a-swap-device

should be edited to warn about this issue?

MobyGamer · 2019-01-18T15:13:31Z

I took the liberty of amending the wiki (I've been affected by this as well). It's quite reproducible in v0.6.5.9-5~bpo8+1 running on 3.16.0-7-amd64 (Debian 3.16.59-1), so I look forward to an eventual fix. I've found no sustainable workaround (other than to stop using zvols for swap).

mafredri · 2019-01-18T18:47:27Z

Thanks @MobyGamer, I'm running ZFS 0.7.12-1~bpo9+1 on Debian Stretch (kernel 4.19.12-1~bpo9+1) on a system with 2GB of RAM and I experienced this issue. I had followed the Debian Stretch Root on ZFS and was lead to believe I should use swap on ZFS. Searching for the cause, there was really nothing in the wiki's about potential down-sides to doing it, and the only "against" I managed to find was this issue. It's good that there is at least some documentation regarding it now 😄.

gmelikov · 2019-01-21T09:52:59Z

@mafredri i updated the wiki. Indeed, we didn't find problems with Swap on ZFS 0.7 branch for a while.

didrocks · 2019-10-31T09:58:25Z

FYI, this issue is what made us switch from a swap zvol to a dedicated partition at the last minute after user's feedback and our own experience in our installer on ubuntu 19.10. (https://bugs.launchpad.net/bugs/1847628)

This is quite an annoying issue as for ext4, we are using for quite a long time a swapfile to prevent too much partionning on user's machine and would like our default ZFS experience to be similar.

scineram · 2019-10-31T10:23:52Z

With multi_vdev_crash_dump it should be possible to swap into a preallocated zvol or file, the writes bypassing zfs completely, right? But it is not implemented yet.

prakashsurya · 2019-10-31T16:47:05Z

@didrocks I gave some information on a possible approach to address this long term, in my comment above. At this point, I think it's accepted that using a ZVOL for swap will result in instabilities, but we haven't had anybody with the time or motivation to step up and attempt to fix this.

I think at least partially, the lack of motivation stems from the fact that (I assume) most folks using ZFS on Linux are not using it for the root filesystem (I use it personally, and so does my employer, so I don't mean this as nobody uses root on ZFS). Now that Ubuntu is officially supporting a root filesystem on ZFS configuration, perhaps that will change. I mean that, both in terms of more users of a root filesystem on ZFS, but also in terms of more developers with sufficient motivation to fix this issue.

RSully · 2019-11-04T23:52:23Z

Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap.

and:

The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).

Not that my 2¢ means much as a non-contributor, but I'd personally really prefer to see a design where at least checksums can be maintained functional so the possibility of end-to-end integrity can remain. Can anyone speak to the amount of complexity/effort that that single feature would add to this work?

I know above Illumos dump devices were mentioned. I am father unfamiliar with Illumos, but I thought dump and swap devices were not the same thing, and that Illumos actually did use zvols for swap. Am I incorrect, or were the above comparisons not quite valid?

prakashsurya · 2019-11-05T00:29:58Z

I'd personally really prefer to see a design where at least checksums can be maintained functional

I think we'd all prefer this, but I don't think it's currently possible if we were to write directly to pre-allocated blocks; due to the checksums being stored in the indirect blocks, and the indirect blocks being written during the pre-allocation rather than when we write out the swap data.

and that Illumos actually did use zvols for swap.

Yes, we used ZVOLs for swap on our illumos based appliance, and we often saw (likely the same) deadlocks for workloads that tried to actually use the swap space.

makhomed · 2023-04-24T01:05:51Z

Current bug, about Linux swap on zvol deadlock #7734 not documented properly, from my point of view, but talk about OpenZFS documentation update better to continued in this dedicated issue, about documenting data corruption bugs and all other critical bugs in the different OpenZFS versions:

Describe in the OpenZFS documentation all known data corruption bugs and all other critical bugs and affected OpenZFS versions #11481

Make the section heading more generic (the section relates to ZFS files as well as ZFS volumes). Swapping to a ZFS volume is prone to deadlock. Remove the related instruction, direct readers to OpenZFS FAQ. Related, but not linked from within the manual page: <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux> (Using a zvol for a swap device on Linux). <openzfs#7734> (Swap deadlock in 0.7.9). Singular, not plural, for non-supported swapping to a file. Pull-request: openzfs#14756 Signed-off-by: Graham Perrin <[email protected]>

Make the section heading more generic (the section relates to ZFS files as well as ZFS volumes). Swapping to a ZFS volume is prone to deadlock. Remove the related instruction, direct readers to OpenZFS FAQ. Related, but not linked from within the manual page: <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux> (Using a zvol for a swap device on Linux). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Graham Perrin <[email protected]> Issue #7734 Closes #14756

Make the section heading more generic (the section relates to ZFS files as well as ZFS volumes). Swapping to a ZFS volume is prone to deadlock. Remove the related instruction, direct readers to OpenZFS FAQ. Related, but not linked from within the manual page: <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux> (Using a zvol for a swap device on Linux). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Graham Perrin <[email protected]> Issue openzfs#7734 Closes openzfs#14756

Squashed commit of the following: commit 1e255365c9bf0e7858561d527c0ebdf8f90bc925 Author: Alexander Motin <[email protected]> Date: Tue Jun 27 20:03:37 2023 -0400 ZIL: Fix another use-after-free. lwb->lwb_issued_txg can not be accessed after lwb_state is set to LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be freed by zil_sync(). We must save the txg number before that. This is similar to the 55b1842f92, but as I see the bug is not new. It existed for quite a while, just was not triggered due to smaller race window. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14988 Closes #14999 commit 233893e7cb7a98895061100ef8363f0ac30204b5 Author: Alexander Motin <[email protected]> Date: Tue Jun 27 20:00:30 2023 -0400 Use big transactions for small recordsize writes. When ZFS appends files in chunks bigger than recordsize, it borrows buffer from ARC and fills it before opening transaction. This supposed to help in case of page faults to not hold transaction open indefinitely. The problem appears when recordsize is set lower than default 128KB. Since each block is committed in separate transaction, per-transaction overhead becomes significant, and what is even worse, active use of of per-dataset and per-pool locks to protect space use accounting for each transaction badly hurts the code SMP scalability. The same transaction size limitation applies in case of file rewrite, but without even excuse of buffer borrowing. To address the issue, disable the borrowing mechanism if recordsize is smaller than default and the write request is 4x bigger than it. In such case writes up to 32MB are executed in single transaction, that dramatically reduces overhead and lock contention. Since the borrowing mechanism is not used for file rewrites, and it was never used by zvols, which seem to work fine, I don't think this change should create significant problems, partially because in addition to the borrowing mechanism there are also used pre-faults. My tests with 4/8 threads writing several files same time on datasets with 32KB recordsize in 1MB requests show reduction of CPU usage by the user threads by 25-35%. I would measure it in GB/s, but at that block size we are now limited by the lock contention of single write issue taskqueue, which is a separate problem we are going to work on. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14964 commit aea27422747921798a9b9e1b8e0f6230d5672ba5 Author: Laevos <[email protected]> Date: Tue Jun 27 16:58:32 2023 -0700 Remove unnecessary commas in zpool-create.8 Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Laevos <[email protected]> Closes #15011 commit 38a821c0d8f6bb51a866354e76078abf6a6ba1fc Author: Alexander Motin <[email protected]> Date: Tue Jun 27 12:09:48 2023 -0400 Another set of vdev queue optimizations. Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from time-sorted AVL-trees to simple lists. AVL-trees are too expensive for such a simple task. To change I/O priority without searching through the trees, add io_queue_state field to struct zio. To not check number of queued I/Os for each priority add vq_cqueued bitmap to struct vdev_queue. Update it when adding/removing I/Os. Make vq_cactive a separate array instead of struct vdev_queue_class member. Together those allow to avoid lots of cache misses when looking for work in vdev_queue_class_to_issue(). Introduce deadline of ~0.5s for LBA-sorted queues. Before this I saw some I/Os waiting in a queue for up to 8 seconds and possibly more due to starvation. With this change I no longer see it. I had to slightly more complicate the comparison function, but since it uses all the same cache lines the difference is minimal. For a sequential I/Os the new code in vdev_queue_io_to_issue() actually often uses more simple avl_first(), falling back to avl_find() and avl_nearest() only when needed. Arrange members in struct zio to access only one cache line when searching through vdev queues. While there, remove io_alloc_node, reusing the io_queue_node instead. Those two are never used same time. Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4 years since implemented, while still wasted time maintaining the offset-sorted tree of TRIM requests. Just remove the tree. Remove locking from txg_all_lists_empty(). It is racy by design, while 2 pair of locks/unlocks take noticeable time under the vdev queue lock. With these changes in my tests with volblocksize=4KB I measure vdev queue lock spin time reduction by 50% on read and 75% on write. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14925 commit 1737e75ab4e09a2d20e7cc64fa83dae047a302e9 Author: Rich Ercolani <[email protected]> Date: Mon Jun 26 16:57:12 2023 -0400 Add a delay to tearing down threads. It's been observed that in certain workloads (zvol-related being a big one), ZFS will end up spending a large amount of time spinning up taskqs only to tear them down again almost immediately, then spin them up again... I noticed this when I looked at what my mostly-idle system was doing and wondered how on earth taskq creation/destroy was a bunch of time... So I added a configurable delay to avoid it tearing down tasks the first time it notices them idle, and the total number of threads at steady state went up, but the amount of time being burned just tearing down/turning up new ones almost vanished. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14938 commit 68b8e2ffab23cba6ae87f18c59b044c833934f2f Author: Alexander Motin <[email protected]> Date: Sat Jun 17 22:51:37 2023 -0400 Fix memory leak in zil_parse(). 482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly leaking up to 128KB of memory per dataset during ZIL replay. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14987 commit ea0d03a8bd040e438bcaa43b8e449cbf717e14f3 Author: George Amanakis <[email protected]> Date: Thu Jun 15 21:45:36 2023 +0200 Shorten arcstat_quiescence sleep time With the latest L2ARC fixes, 2 seconds is too long to wait for quiescence of arcstats like l2_size. Shorten this interval to avoid having the persistent L2ARC tests in ZTS prematurely terminated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14981 commit 3fa141285b8105b3cc11c1296b77ad6d24250f2c Author: Alexander Motin <[email protected]> Date: Thu Jun 15 13:49:03 2023 -0400 Remove ARC/ZIO physdone callbacks. Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14948 commit 9efc735904d194987f06870f355e08d94e39ab81 Author: Brian Behlendorf <[email protected]> Date: Wed Jun 14 10:04:05 2023 -0500 ZTS: Skip send_raw_ashift on FreeBSD On FreeBSD 14 this test runs slowly in the CI environment and is killed by the 10 minute timeout. Skip the test on FreeBSD until the slow down is resolved. Signed-off-by: Brian Behlendorf <[email protected]> Issue #14961 commit 9c54894bfc77f585806984f44c70a839543e6715 Author: Alexander Motin <[email protected]> Date: Wed Jun 14 11:02:27 2023 -0400 Switch refcount tracking from lists to AVL-trees. With large number of tracked references list searches under the lock become too expensive, creating enormous lock contention. On my tests with ZFS_DEBUG enabled this increases write throughput with 32KB blocks from ~1.2GB/s to ~7.5GB/s. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14970 commit 4e62540827a6ed15e08b2a627896d24bc661fa38 Author: George Amanakis <[email protected]> Date: Wed Jun 14 17:01:17 2023 +0200 Store the L2ARC device ashift in the vdev label If this is not done, and the pool has an ashift other than the default (at the moment 9) then the following happens: 1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but upon export it is not stored anywhere 2) at the first import, vdev_open() sees an vdev_ashift() of 0 and assigns the logical_ashift, which is 9 3) reading the contents of L2ARC, including the header fails 4) L2ARC buffers are not restored in ARC. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14313 Closes #14963 commit adaa3e64ea46f21cc5f544228c48363977b7733e Author: George Amanakis <[email protected]> Date: Sat Jun 10 02:05:47 2023 +0200 Fix the L2ARC write size calculating logic (2) While commit bcd5321 adjusts the write size based on the size of the log block, this happens after comparing the unadjusted write size to the evicted (target) size. In this case l2ad_hand will exceed l2ad_evict and violate an assertion at the end of l2arc_write_buffers(). Fix this by adding the max log block size to the allocated size of the buffer to be committed before comparing the result to the target size. Also reset the l2arc_trim_ahead ZFS module variable when the adjusted write size exceeds the size of the L2ARC device. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14936 Closes #14954 commit 67118a7d6e74a6e818127096162478017610d13e Author: Andrew Innes <[email protected]> Date: Wed Jun 28 12:31:10 2023 +0800 Windows: Finally drop long disabled vdev cache. Signed-off-by: Andrew Innes <[email protected]> commit 5d80c98c28c931339138753a4e4c1156dbf951f4 Author: Alexander Motin <[email protected]> Date: Fri Jun 9 15:40:55 2023 -0400 Finally drop long disabled vdev cache. It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14953 commit 1f1ab33781b5736654b988e2e618ea79788fa1f7 Author: Brian Behlendorf <[email protected]> Date: Fri Jun 9 11:10:01 2023 -0700 ZTS: Skip checkpoint_discard_busy Until the ASSERT which is occasionally hit while running checkpoint_discard_busy is resolved skip this test case. Signed-off-by: Brian Behlendorf <[email protected]> Issue #12053 Closes #14952 commit b94049c2cbedbbe2af8e629bf974a6ed93f11acb Author: Alexander Motin <[email protected]> Date: Fri Jun 9 13:14:05 2023 -0400 Improve l2arc reporting in arc_summary. - Do not report L2ARC as FAULTED in presence of in-flight writes. - Report read and write I/Os, bytes and errors. - Remove few numbers not important to average user. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #12304 Closes #14946 commit 31044b5cfb6f91d376034c4d6374f61baaf03232 Author: Andrew Innes <[email protected]> Date: Wed Jun 28 12:00:39 2023 +0800 Windows: Use list_remove_head() where possible. Signed-off-by: Andrew Innes <[email protected]> commit 32eda54d0d75a94b6aa71dc80aa958095feb8011 Author: Alexander Motin <[email protected]> Date: Fri Jun 9 13:12:52 2023 -0400 Use list_remove_head() where possible. ... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14955 commit fe7693a3f87229d1ae93b5ce2bb84d8bb86a9f5c Author: Alexander Motin <[email protected]> Date: Fri Jun 9 13:08:05 2023 -0400 ZIL: Fix race introduced by f63811f0721. We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE state and dropping zl_lock, since it may be freed by zil_sync(). To free itxs and waiters after dropping the lock we need to move lwb_itxs and lwb_waiters lists elements to local storage. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14957 Closes #14959 commit 44c5a0c92f98e8c21221bd7051729d1947a10736 Author: Rich Ercolani <[email protected]> Date: Wed Jun 7 14:14:05 2023 -0400 Revert "systemd: Use non-absolute paths in Exec* lines" This reverts commit 79b20949b25c8db4d379f6486b0835a6613b480c since it doesn't work with the systemd version shipped with RHEL7-based systems. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14943 Closes #14945 commit ba5af00257eb4eb3363f297819a21c4da811392f Author: Brian Behlendorf <[email protected]> Date: Wed Jun 7 10:43:43 2023 -0700 Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926) When a kmem cache is exhausted and needs to be expanded a new slab is allocated. KM_SLEEP callers can block and wait for the allocation, but KM_NOSLEEP callers were incorrectly allowed to block as well. Resolve this by attempting an emergency allocation as a best effort. This may fail but that's fine since any KM_NOSLEEP consumer is required to handle an allocation failure. Signed-off-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tony Hutter <[email protected]> commit d4ecd4efde1692641d1d0b89851e7a15e90632f8 Author: George Amanakis <[email protected]> Date: Tue Jun 6 21:32:37 2023 +0200 Fix the L2ARC write size calculating logic l2arc_write_size() should return the write size after adjusting for trim and overhead of the L2ARC log blocks. Also take into account the allocated size of log blocks when deciding when to stop writing buffers to L2ARC. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14939 commit 8692ab174e18faf444681d67d7ea4418600553cc Author: Rob Norris <[email protected]> Date: Wed Mar 15 18:18:10 2023 +1100 zdb: add -B option to generate backup stream This is more-or-less like `zfs send`, but specifying the snapshot by its objset id for situations where it can't be referenced any other way. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14642 commit df84ca3f3bf9f265ebc76de17394df529fd07af6 Author: Andrew Innes <[email protected]> Date: Wed Jun 28 11:05:55 2023 +0800 Windows: znode: expose zfs_get_zplprop to libzpool Signed-off-by: Andrew Innes <[email protected]> commit 944c58247a13a92c9e4ffb2c0a9e6b6293dca37e Author: Rob Norris <[email protected]> Date: Sun Jun 4 11:14:20 2023 +1000 znode: expose zfs_get_zplprop to libzpool There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14642 commit 429f58cdbb195c8d50ed95c7309ee54d37526b70 Author: Alexander Motin <[email protected]> Date: Mon Jun 5 14:51:44 2023 -0400 Introduce zfs_refcount_(add|remove)_few(). There are two places where we need to add/remove several references with semantics of zfs_refcount_(add|remove). But when debug/tracing is disabled, it is a crime to run multiple atomic_inc() in a loop, especially under congested pool-wide allocator lock. Introduced new functions implement the same semantics as the loop, but without overhead in production builds. Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14934 commit 077c2f359feb69a13bee37ac4220d271d1c7bf27 Author: Brian Behlendorf <[email protected]> Date: Mon Jun 5 11:08:24 2023 -0700 Linux 6.3 compat: META (#14930) Update the META file to reflect compatibility with the 6.3 kernel. Signed-off-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> commit c2fcd6e484107fc7435087771757e88ba84f6093 Author: Graham Perrin <[email protected]> Date: Fri Jun 2 19:25:13 2023 +0100 zfs-create(8): ZFS for swap: caution, clarity Make the section heading more generic (the section relates to ZFS files as well as ZFS volumes). Swapping to a ZFS volume is prone to deadlock. Remove the related instruction, direct readers to OpenZFS FAQ. Related, but not linked from within the manual page: <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux> (Using a zvol for a swap device on Linux). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Graham Perrin <[email protected]> Issue #7734 Closes #14756 commit 251dbe83e14085a26100aa894d79772cbb69dcda Author: Alexander Motin <[email protected]> Date: Fri Jun 2 14:01:58 2023 -0400 ZIL: Allow to replay blocks of any size. There seems to be no reason for ZIL blocks to be limited by 128KB other than replay code is written in such a way. This change does not increase the limit yet, just removes the artificial limitation. Avoided extra memcpy() may save us a second during replay. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14910 commit 76170249d538965655dbd3206cd59566b1d3944b Author: Val Packett <[email protected]> Date: Thu May 11 18:16:57 2023 -0300 PAM: enable testing on FreeBSD Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit d1b68a45441cae8c399a8a3ed60b29726ed031ff Author: Val Packett <[email protected]> Date: Fri May 5 22:17:12 2023 -0300 PAM: support password changes even when not mounted There's usually no requirement that a user be logged in for changing their password, so let's not be surprising here. We need to use the fetch_lazy mechanism for the old password to avoid a double prompt for it, so that mechanism is now generalized a bit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit 7424feff72f1e17ea27bcfe0d36cabce7c732eea Author: Val Packett <[email protected]> Date: Fri May 5 22:34:58 2023 -0300 PAM: add 'uid_min' and 'uid_max' options for changing the uid range Instead of a fixed >=1000 check, allow the configuration to override the minimum UID and add a maximum one as well. While here, add the uid range check to the authenticate method as well, and fix the return in the chauthtok method (seems very wrong to report success when we've done absolutely nothing). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit fc9e012f5fc7e7997acee2b6d8d759622b319f0e Author: Val Packett <[email protected]> Date: Fri May 5 22:02:13 2023 -0300 PAM: add 'forceunmount' flag Probably not always a good idea, but it's nice to have the option. It is a workaround for FreeBSD calling the PAM session end earier than the last process is actually done touching the mount, for example. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit a39ed83bd31cc0c8c98dc3c4cc3d11b03d9af620 Author: Val Packett <[email protected]> Date: Fri May 5 19:35:57 2023 -0300 PAM: add 'recursive_homes' flag to use with 'prop_mountpoint' It's not always desirable to have a fixed flat homes directory. With the 'recursive_homes' flag, 'prop_mountpoint' search would traverse the whole tree starting at 'homes' (which can now be '*' to mean all pools) to find a dataset with a mountpoint matching the home directory. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit 7f8d5ef815b7559fcc671ff2add33ba9c2a74867 Author: Val Packett <[email protected]> Date: Fri May 5 21:56:39 2023 -0300 PAM: use boolean_t for config flags Since we already use boolean_t in the file, we can use it here. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit e2872932c85189f06a68f0ad10bd8eb6895d79c2 Author: Val Packett <[email protected]> Date: Fri May 5 20:00:48 2023 -0300 PAM: do not fail to mount if the key's already loaded If we're expecting a working home directory on login, it would be rather frustrating to not have it mounted just because it e.g. failed to unmount once on logout. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit b897137e2044c3ef6120820f753d940b7dfb58be Author: Rich Ercolani <[email protected]> Date: Wed May 31 19:58:41 2023 -0400 Revert "initramfs: use `mount.zfs` instead of `mount`" This broke mounting of snapshots on / for users. See https://github.com/openzfs/zfs/issues/9461#issuecomment-1376162949 for more context. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14908 commit 10cde4f8f60d4d55887d7122a5742e6e4f90280c Author: Luís Henriques <[email protected]> Date: Tue May 30 23:15:24 2023 +0100 Fix NULL pointer dereference when doing concurrent 'send' operations A NULL pointer will occur when doing a 'zfs send -S' on a dataset that is still being received. The problem is that the new 'send' will rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will fail), but then dmu_send() will still do the dsl_dataset_disown(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Luís Henriques <[email protected]> Closes #14903 Closes #14890 commit 12452d79a3fd29af1dc0b95f3e367e3ce339702b Author: Brian Behlendorf <[email protected]> Date: Mon May 29 12:55:35 2023 -0700 ZTS: zvol_misc_trim disable blk mq Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted kernels. This issue is being actively worked in #14872 and as part of that fix this commit will be reverted. VERIFY(zh->zh_claim_txg == 0) failed PANIC at zil.c:904:zil_create() Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14872 Closes #14870 commit 803c04f233e60a2d23f0463f299eba96c0968602 Author: Richard Yao <[email protected]> Date: Fri May 26 18:47:52 2023 -0400 Use __attribute__((malloc)) on memory allocation functions This informs the C compiler that pointers returned from these functions do not alias other functions, which allows it to do better code optimization and should make the compiled code smaller. References: https://stackoverflow.com/a/53654773 https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute https://clang.llvm.org/docs/AttributeReference.html#malloc Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14827 commit 64d8bbe15f77876ae9639b9971a743776a41bf9a Author: Brian Behlendorf <[email protected]> Date: Fri May 26 15:39:23 2023 -0700 ZTS: Add zpool_resilver_concurrent exception The zpool_resilver_concurrent test case requires the ZED which is not used on FreeBSD. Add this test to the known list of skipped tested for FreeBSD. Signed-off-by: Brian Behlendorf <[email protected]> Closes #14904 commit e396d30d29ed131194605222e6ba1fec1ef8b2ca Author: Mike Swanson <[email protected]> Date: Fri May 26 15:37:15 2023 -0700 Add compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2} Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mike Swanson <[email protected]> Closes #14902 commit f6dd0b8c1cc41707d299b7123f80912f43d03340 Author: Colm <[email protected]> Date: Fri May 26 10:04:19 2023 -0700 Adding new read-only compatible zpool features to compatibility.d/grub2 GRUB2 is compatible with all "read-only compatible" features, so it is safe to add new features of this type to the grub2 compatibility list. We generally want to include all compatible features, to minimize the differences between grub2-compatible pools and no-compatibility pools. Adding new properties `livelist` and `zpool_checkpoint` accordingly. Also adding them to the man page which references this file as an example, for consistency. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #14893 commit 013d3a1e0e00d83dabe70837b23dab48c1bac592 Author: Richard Yao <[email protected]> Date: Fri May 26 13:03:12 2023 -0400 btree: Implement faster binary search algorithm This implements a binary search algorithm for B-Trees that reduces branching to the absolute minimum necessary for a binary search algorithm. It also enables the compiler to inline the comparator to ensure that the only slowdown when doing binary search is from waiting for memory accesses. Additionally, it instructs the compiler to unroll the loop, which gives an additional 40% improve with Clang and 8% improvement with GCC. Consumers must opt into using the faster algorithm. At present, only B-Trees used inside kernel code have been modified to use the faster algorithm. Micro-benchmarks suggest that this can improve binary search performance by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when compiling with GCC 12.2. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14866 commit 1854df330aa57cda39f076e8ab11e17ca3697bb8 Author: George Amanakis <[email protected]> Date: Fri May 26 18:53:00 2023 +0200 Fix inconsistent definition of zfs_scrub_error_blocks_per_txg Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14894 commit 8735e6ac03742fcf43adde3ce127af698a32c53a Author: Damiano Albani <[email protected]> Date: Fri May 26 01:10:54 2023 +0200 Add missing files to Debian DKMS package Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Damiano Albani <[email protected]> Closes #14887 Closes #14889 commit d439021bd05a5cc0bb271a5470abb67af2f7bcda Author: Brian Behlendorf <[email protected]> Date: Thu May 25 13:53:08 2023 -0700 Update compatibility.d files Add an openzfs-2.2 compatibility file for the next release. Edon-R support has been enabled for FreeBSD removing the need for different FreeBSD and Linux files. Symlinks for the -linux and -freebsd names are created for any scripts expecting that convention. Additionally, a symlink for ubunutu-22.04 was added. Signed-off-by: Brian Behlendorf <[email protected]> Closes #14833 commit da54d5f3f9576b958e3eadf4f4d8f68c91b3d6e4 Author: Alexander Motin <[email protected]> Date: Thu May 25 16:51:53 2023 -0400 zil: Add some more statistics. In addition to a number of actual log bytes written, account also a total written bytes including padding and total allocated bytes (bytes <= write <= alloc). It should allow to monitor zil traffic and space efficiency. Add dtrace probe for zil block size selection. Make zilstat report more information and fit it into less width. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14863 commit faa4955023d089668bd6c564c195a933d1eac455 Author: Alexander Motin <[email protected]> Date: Thu May 25 12:48:43 2023 -0400 ZIL: Reduce scope of per-dataset zl_issuer_lock. Before this change ZIL copied all log data while holding the lock. It caused huge lock contention on workloads with many big parallel writes. This change splits the process into two parts: first, zil_lwb_assign() estimates the log space needed for all transactions, and zil_lwb_write_close() allocates blocks and zios while holding the lock, then, after the lock in dropped, zil_lwb_commit() copies the data, and zil_lwb_write_issue() issues the I/Os. Also while there slightly reduce scope of zl_lock. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14841 commit f77b9f7ae83834ade1da21cfc16b8a273df3acfc Author: Dimitri John Ledkov <[email protected]> Date: Wed May 24 20:31:28 2023 +0100 systemd: Use non-absolute paths in Exec* lines Since systemd v239, Exec* binaries are resolved from PATH when they are not-absolute. Switch to this by default for ease of downstream maintenance. Many downstream distributions move individual binaries to locations that existing compile-time configurations cannot accommodate. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Dimitri John Ledkov <[email protected]> Closes #14880 commit 4bfb9d28cffd4dfeb4b91359b497d100f668bb34 Author: Akash B <[email protected]> Date: Thu May 25 00:58:09 2023 +0530 Fix concurrent resilvers initiated at same time For draid vdevs it was possible to initiate both the sequential and healing resilver at same time. This fixes the following two scenarios. 1) There's a window where a sequential rebuild can be started via ZED even if a healing resilver has been scheduled. - This is fixed by adding additional check in spa_vdev_attach() for any scheduled resilver and return appropriate error code when a resilver is already in progress. 2) It was possible for zpool clear to start a healing resilver when it wasn't needed at all. This occurs because during a vdev_open() the device is presumed to be healthy not until the device is validated by vdev_validate() and it's set unavailable. However, by this point an async resilver will have already been requested if the DTL isn't empty. - This is fixed by cancelling the SPA_ASYNC_RESILVER request immediately at the end of vdev_reopen() when a resilver is unneeded. Finally, added a testcase in ZTS for verification. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #14881 Closes #14892 commit c9bb406d177a00aa1f0058d29aeb29e478223273 Author: youzhongyang <[email protected]> Date: Wed May 24 15:23:42 2023 -0400 Linux 6.4 compat: reclaimed_slab renamed to reclaimed Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14891 commit 79e61a873b136f13fcf140beb925ceddc1f94767 Author: Brian Atkinson <[email protected]> Date: Fri May 19 16:05:53 2023 -0400 Hold db_mtx when updating db_state Commit 555ef90 did some general code refactoring for dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was not held when update db->db_state in those code block. The rest of the dbuf code always holds the db_mtx when updating db_state. This is important because cv_wait() db_changed is used to check for db_state changes. Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the db_mtx when updating db_state. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #14875 commit d7be0cdf93a568b6c9b4a4e15a88a5d88ebbb764 Author: Brian Behlendorf <[email protected]> Date: Fri May 19 13:05:09 2023 -0700 Probe vdevs before marking removed Before allowing the ZED to mark a vdev as REMOVED due to a hotplug event confirm that it is non-responsive with probe. Any device which can be successfully probed should be left ONLINE to prevent a healthy pool from being incorrectly SUSPENDED. This may occur for at least the following two scenarios. 1) Drive expansion (zpool online -e) in VMware environments. If, during the partition resize operation, a partition is removed and re-created then udev will send a removed event. 2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan) may result in a udev remove and add event being delivered. Finally, update the ZED to only kick in a spare when the removal was successful. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14859 Closes #14861 commit 054bb22686045ea1499065a4456568f0c21d939b Author: Andrew Innes <[email protected]> Date: Tue Jun 27 09:20:56 2023 +0800 Windows: Teach zpool scrub to scrub only blocks in error log Signed-off-by: Andrew Innes <[email protected]> commit b61e89a3e68ae19819493183ff3d1fe7bf4ffe2b Author: George Amanakis <[email protected]> Date: Fri Dec 17 21:35:28 2021 +0100 Teach zpool scrub to scrub only blocks in error log Added a flag '-e' in zpool scrub to scrub only blocks in error log. A user can pause, resume and cancel the error scrub by passing additional command line arguments -p -s just like a regular scrub. This involves adding a new flag, creating new libzfs interfaces, a new ioctl, and the actual iteration and read-issuing logic. Error scrubbing is executed in multiple txg to make sure pool performance is not affected. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: TulsiJain [email protected] Signed-off-by: George Amanakis <[email protected]> Closes #8995 Closes #12355 commit 61bfb3cb5dd792ec7ca0fbfca59b165f3ddbe1f5 Author: Brian Behlendorf <[email protected]> Date: Thu May 18 10:02:20 2023 -0700 Add the ability to uninitialize zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12451 Closes #14873 commit 855b62942d4ca5dab3d65b7000f9d284fd1560bb Author: Antonio Russo <[email protected]> Date: Mon May 15 17:11:33 2023 -0600 test-runner: pass kmemleak and kmsg to Cmd.run test-runner.py orchestrates all of the ZTS executions. The `Cmd` object manages these process, and its `run` method specifically invokes these possibly long-running processes, possibly retrying in the event of a timeout. Since its inception, memory leak detection using the kmemleak infrastructure [1], and kernel logging [2] have been added to this run mechanism. However, the callback to cull a process beyond its timeout threshold, `kill_cmd`, has evaded modernization by both of these changes. As a result, this function fails to properly invoke `run`, leading to an untrapped exception and unreported test failure. This patch extends `kill_cmd` to receive these kernel devices through the `options` parameter, and regularizes all the `.run` calls from `Cmd`, and its subclasses, to accept that parameter. [1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1 [2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba Reviewed-by: John Wren Kennedy <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #14849 commit 537939565123fd2afa097e9a56ee3efd28779e5f Author: Richard Yao <[email protected]> Date: Fri May 12 17:10:14 2023 -0400 Fix undefined behavior in spa_sync_props() 8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin complaining about "Improper use of negative value" in two places in spa_sync_props() because Coverity correctly inferred from `prop == ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name() and zpool_prop_get_type() use it an array index, which is undefined behavior. Assuming that the system does not panic from an attempt to read invalid memory, the case statement for ZPOOL_PROP_INVAL will ensure that only user properties will reach this code when prop is ZPOOL_PROP_INVAL, such that execution will continue safely. However, if we are unlucky enough to read invalid memory, then the system will panic. This issue predates the patch that caused coverity to begin complaining. Thankfully, our userland tools do not pass nonsense to us, so this bug should not be triggered unless a future userland tool attempts to set a property that we do not understand. Reported-by: Coverity (CID-1561129) Reported-by: Coverity (CID-1561130) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14860 commit 02351b380f0430980bfb92e83d0800df104bd06a Author: Richard Yao <[email protected]> Date: Fri May 12 16:47:56 2023 -0400 Fix use after free regression in spa_remove_healed_errors() 6839ec6f1098c28ff7b772f1b31b832d05e6b567 placed code in spa_remove_healed_errors() that uses a pointer after the kmem_free() call that frees it. Reported-by: Coverity (CID-1562375) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14860 commit e9b315ffb79ff6419694a2713fcd5fd448317904 Author: Andrew Innes <[email protected]> Date: Mon May 15 13:52:35 2023 +0800 Use python3 on windows commit 3346a5b78c2db15801ce54a70a323952fdf67fa5 Author: Jorgen Lundman <[email protected]> Date: Thu Jun 22 08:56:38 2023 +0900 zfs_write() ignores errors If files were advanced by zfs_freesp() we ignored any errors returned by it. Signed-off-by: Jorgen Lundman <[email protected]> commit cce49c08316bc6a5dff287f4fa15856e26d5b18a Author: Jorgen Lundman <[email protected]> Date: Thu Jun 22 08:55:55 2023 +0900 Correct Stream event path The Stream path events used the incorrect name "stream", now uses "file.txt:stream" as per ntfs. Signed-off-by: Jorgen Lundman <[email protected]> commit 0f83d31e288d789fb4e10a7e4b12e27887820498 Author: Jorgen Lundman <[email protected]> Date: Wed Jun 21 14:30:13 2023 +0900 Add stub for file_hard_link_information() Signed-off-by: Jorgen Lundman <[email protected]> commit 8d6db9490364e4d281546445571d2ca9d5abda22 Author: Jorgen Lundman <[email protected]> Date: Wed Jun 21 14:29:43 2023 +0900 Return correct FileID in dirlist Signed-off-by: Jorgen Lundman <[email protected]> commit 4c011397229e3c38259d6956458a4fd287dca72d Author: Andrew Innes <[email protected]> Date: Wed Jun 21 10:17:30 2023 +0800 Fix logic (#232) Signed-off-by: Andrew Innes <[email protected]> commit 467436b676ad897025b7ed90d8f033969da441cc Author: Andrew Innes <[email protected]> Date: Wed Jun 21 09:47:38 2023 +0800 Run winbtrfs tests by default (#231) Signed-off-by: Andrew Innes <[email protected]> commit 56eca2a5d116c66b10579f9cf6d5f271991c7e2e Author: Jorgen Lundman <[email protected]> Date: Wed Jun 21 09:54:00 2023 +0900 SetFilePositionInformation SetFileValidDataLengthInformation Signed-off-by: Jorgen Lundman <[email protected]> commit b4fbbda470f27aee565dfa9bc0d68217b969339c Author: Andrew Innes <[email protected]> Date: Tue Jun 20 16:33:12 2023 +0800 Add sleep to tests (#230) Signed-off-by: Andrew Innes <[email protected]> commit 94f1f52807d1f8c0c2931e9e52b91f0ce5e488f4 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 20 16:53:50 2023 +0900 CreateFile of newfile:newstream should create both In addition, many more stream fixes, illegal chars, and names Signed-off-by: Jorgen Lundman <[email protected]> commit 894d512880d39ecf40e841c6d7b73157dfe397e0 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 20 08:41:37 2023 +0900 Windows streams should return parent file ID When asked for File ID of a stream, it should return the FileID of the parent file, which is two levels up. Signed-off-by: Jorgen Lundman <[email protected]> commit 0cc45d2154a2866b2f494c3790a57555c29e60c3 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 20 08:32:44 2023 +0900 Support FILE_STANDARD_INFORMATION_EX Signed-off-by: Jorgen Lundman <[email protected]> commit a6edd02999d581db56f4a53567f4c5db11778f64 Author: Jorgen Lundman <[email protected]> Date: Mon Jun 19 10:36:13 2023 +0900 Add xattr compat code from upstream and adjust calls to new API calls. This adds xattr=sa support to Windows. Signed-off-by: Jorgen Lundman <[email protected]> commit 0e1476a3942990385d32c02403ebe2c815d567db Author: Jorgen Lundman <[email protected]> Date: Wed Jun 14 11:56:09 2023 +0900 Set EA can panic Signed-off-by: Jorgen Lundman <[email protected]> commit 4a1adef6b8c2851195d692a42d5718c9a1b03490 Author: Jorgen Lundman <[email protected]> Date: Wed Jun 14 09:49:57 2023 +0900 Incorrect MAXPATH used in delete entry Signed-off-by: Jorgen Lundman <[email protected]> commit 2c0d119e37cb3eed1acac90efa9fe0f8c173e0f0 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 13 16:19:42 2023 +0900 Large changes fixing FS notify events Some incorrect behavior still, query name of a stream is wrong. Signed-off-by: Jorgen Lundman <[email protected]> commit 5b2b2b0550a493497a0b460206079fd57c639543 Author: Jorgen Lundman <[email protected]> Date: Tue May 16 14:42:52 2023 +0900 file name and file full information buffer overrun When a buffer is not big enough, we would still null terminate on the full string, beyond the supplied buffer. Signed-off-by: Jorgen Lundman <[email protected]> commit 94bfb92951a5ccdef7b2a1fb818fafdafbc4fff0 Author: Jorgen Lundman <[email protected]> Date: Tue May 16 11:48:12 2023 +0900 Correct Query EA and Query Streams Which includes: * NextEntryOffset is not offset from Buffer, but from one struct to the next struct. * Pack only complete EAs, and return Overflow if does not fit * query file EA information would return from Information=size * Call cleareaszie on VP when EAs have changed Signed-off-by: Jorgen Lundman <[email protected]> commit 9c7a4071fcfc99c3308620fc1943355f9ade34b3 Author: Alexander Motin <[email protected]> Date: Fri May 12 12:49:26 2023 -0400 zil: Free lwb_buf after write completion. There is no sense to keep that memory allocated during the flush. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14855 commit 7e91b3222ddaadc10c92d1065529886dd3806acc Author: Alexander Motin <[email protected]> Date: Fri May 12 12:14:29 2023 -0400 zil: Some micro-optimizations. Should not cause functional changes. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14854 commit 6b62c3b0e10de782c3aef0e1206aa48875519c4e Author: Don Brady <[email protected]> Date: Fri May 12 10:12:28 2023 -0600 Refine special_small_blocks property validation When the special_small_blocks property is being set during a pool create it enforces a limit of 128KiB even if the pool's record size is larger. If the recordsize property is being set during a pool create, then use that value instead of the default SPA_OLD_MAXBLOCKSIZE value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #13815 Closes #14811 commit d0ab2dddde618c394fa7fe88211276786ba8ca12 Author: Brian Behlendorf <[email protected]> Date: Fri May 12 09:07:58 2023 -0700 ZTS: Add auto_replace_001_pos to exceptions The auto_replace_001_pos test case does not reliably pass on Fedora 37 and newer. Until the test case can be updated to make it reliable add it to the list of "maybe" exceptions on Linux. Signed-off-by: Brian Behlendorf <[email protected]> Issue #14851 Closes #14852 commit 1e3e7a103a5026e9a2005acec7017e4024d95115 Author: Pawel Jakub Dawidek <[email protected]> Date: Tue May 9 22:32:30 2023 -0700 Make sure we are not trying to clone a spill block. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit a22891c3272d8527d4c8cb7ff52a25ef396e7add Author: Pawel Jakub Dawidek <[email protected]> Date: Thu May 4 16:14:19 2023 -0700 Correct comment. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 9b016166dd5875db87963b5deeca8eeda094b571 Author: Pawel Jakub Dawidek <[email protected]> Date: Wed May 3 23:25:22 2023 -0700 Remove badly placed comment. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 6bcd48e213a279781ecd6df22799532cbec353d6 Author: Pawel Jakub Dawidek <[email protected]> Date: Wed May 3 00:24:47 2023 -0700 Don't call zfs_exit_two() before zfs_enter_two(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 0919c985e294a89169adacd5ed4a240945e5fbee Author: Pawel Jakub Dawidek <[email protected]> Date: Tue May 2 15:46:14 2023 -0700 Don't use dmu_buf_is_dirty() for unassigned transaction. The dmu_buf_is_dirty() call doesn't make sense here for two reasons: 1. txg is 0 for unassigned tx, so it was a no-op. 2. It is equivalent of checking if we have dirty records and we are doing this few lines earlier. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 7f88494ac91c61aeffad810e7d167badb875166e Author: Pawel Jakub Dawidek <[email protected]> Date: Tue May 2 14:24:43 2023 -0700 Deny block cloning is dbuf size doesn't match BP size. I don't know an easy way to shrink down dbuf size, so just deny block cloning into dbufs that don't match our BP's size. This fixes the following situation: 1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB. 2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB. 3. Truncate the large file to 0. Its dbuf will remain 2kB. 4. Clone the small file into the large file. Small file's BP lsize is 1kB, but the large file's dbuf is 2kB. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 49657002f9cb57b9b4675100aaf58e1e93984bbf Author: Pawel Jakub Dawidek <[email protected]> Date: Sun Apr 30 02:47:09 2023 -0700 Additional block cloning fixes. Reimplement some of the block cloning vs dbuf logic, mostly to fix situation where we clone a block and in the same transaction group we want to partially overwrite the clone. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 4d31369d3055bf0cf1d4f3e1e7d43d745f2fd05f Author: Alexander Motin <[email protected]> Date: Thu May 11 17:27:12 2023 -0400 zil: Don't expect zio_shrink() to succeed. At least for RAIDZ zio_shrink() does not reduce zio size, but reduced wsz in that case likely results in writing uninitialized memory. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14853 commit 663dc5f616e6d0427207ffcf7a83dd02fe06a707 Author: Ameer Hamza <[email protected]> Date: Wed May 10 05:56:35 2023 +0500 Prevent panic during concurrent snapshot rollback and zvol read Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent release of the dnode, avoiding panic when a snapshot is rolled back in parallel during ongoing zvol read operation. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14839 commit 7375f4f61ca587f893435184f398a767ae52fbea Author: Tony Hutter <[email protected]> Date: Tue May 9 17:55:19 2023 -0700 pam: Fix "buffer overflow" in pam ZTS tests on F38 The pam ZTS tests were reporting a buffer overflow on F38, possibly due to F38 now setting _FORTIFY_SOURCE=3 by default. gdb and valgrind narrowed this down to a snprintf() buffer overflow in zfs_key_config_modify_session_counter(). I'm not clear why this particular snprintf() was being flagged as an overflow, but when I replaced it with an asprintf(), the test passed reliably. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #14802 Closes #14842 commit 9d3ed831f309e28a9cad56c8b1520292dbad0d7b Author: Brian Behlendorf <[email protected]> Date: Tue May 9 09:03:10 2023 -0700 Add dmu_tx_hold_append() interface Provides an interface which callers can use to declare a write when the exact starting offset in not yet known. Since the full range being updated is not available only the first L0 block at the provided offset will be prefetched. Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14819 commit 2b6033d71da38015c885297d1ee6577871099744 Author: Brian Behlendorf <[email protected]> Date: Tue May 9 08:57:02 2023 -0700 Debug auto_replace_001_pos failures Reduced the timeout to 60 seconds which should be more than sufficient and allow the test to be marked as FAILED rather than KILLED. Also dump the pool status on cleanup. Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14829 commit f4adc2882fb162c82e9738c5d2d30e3ba8a66367 Author: George Amanakis <[email protected]> Date: Tue May 9 17:54:41 2023 +0200 Remove duplicate code in l2arc_evict() l2arc_evict() performs the adjustment of the size of buffers to be written on L2ARC unnecessarily. l2arc_write_size() is called right before l2arc_evict() and performs those adjustments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14828 commit 9b2c182d291bbb3ece9ceb1c72800d238d19b2e7 Author: Alexander Motin <[email protected]> Date: Tue May 9 11:54:01 2023 -0400 Remove single parent assertion from zio_nowait(). We only need to know if ZIO has any parent there. We do not care if it has more than one, but use of zio_unique_parent() == NULL asserts that. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14823 commit 4def61804c052a1235179e3a7c98305d8075e0e9 Author: George Amanakis <[email protected]> Date: Tue May 9 17:53:27 2023 +0200 Enable the head_errlog feature to remove errors In case check_filesystem() does not error out and does not report an error, remove that error block from error lists and logs without requiring a scrub. This can happen when the original file and all snapshots/clones referencing it have been removed. Otherwise zpool status will still report that "Permanent errors have been detected..." without actually reporting any of them. To implement this change the functions introduced in corrective receive were modified to take into account the head_errlog feature. Before this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: ============================= After this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: No known data errors ============================= Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14813 commit 3f2f9533ca8512ef515a73ac5661598a65b896b6 Author: George Amanakis <[email protected]> Date: Mon May 8 22:35:03 2023 +0200 Fixes in head_errlog feature with encryption For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of dsl_dataset_hold_obj() in order to enable access to the encryption keys (if loaded). This enables reporting of errors in encrypted filesystems which are not mounted but have their keys loaded. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14837 commit 288ea63effae3ba24fcb6dc412a3125b9f3e1da9 Author: Matthew Ahrens <[email protected]> Date: Mon May 8 11:20:23 2023 -0700 Verify block pointers before writing them out If a block pointer is corrupted (but the block containing it checksums correctly, e.g. due to a bug that overwrites random memory), we can often detect it before the block is read, with the `zfs_blkptr_verify()` function, which is used in `arc_read()`, `zio_free()`, etc. However, such corruption is not typically recoverable. To recover from it we would need to detect the memory error before the block pointer is written to disk. This PR verifies BP's that are contained in indirect blocks and dnodes before they are written to disk, in `dbuf_write_ready()`. This way, we'll get a panic before the on-disk data is corrupted. This will help us to diagnose what's causing the corruption, as well as being much easier to recover from. To minimize performance impact, only checks that can be done without holding the spa_config_lock are performed. Additionally, when corruption is detected, the raw words of the block pointer are logged. (Note that `dprintf_bp()` is a no-op by default, but if enabled it is not safe to use with invalid block pointers.) Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #14817 commit 23132688b9d54ef11413925f88c02d83d607ec2b Author: Brian Behlendorf <[email protected]> Date: Mon May 8 11:17:41 2023 -0700 zdb: consistent xattr output When using zdb to output the value of an xattr only interpret it as printable characters if the entire byte array is printable. Additionally, if the --parseable option is set always output the buffer contents as octal for easy parsing. Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14830 commit 6deb342248e10af92e2d3fbb4e4b1221812188ff Author: Brian Behlendorf <[email protected]> Date: Mon May 8 10:09:30 2023 -0700 ZTS: add snapshot/snapshot_002_pos exception Add snapshot_002_pos to the known list of occasional failures for FreeBSD until it can be made entirely reliable. Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14831 Closes #14832 commit a0a125bab291fe005d29be5375a5bb2a1c8261c7 Author: Alexander Motin <[email protected]> Date: Fri May 5 12:17:55 2023 -0400 Fix two abd_gang_add_gang() issues. - There is no reason to assert that added gang is not empty. It may be weird to add an empty gang, but it is legal. - When moving chain list from the added gang clear its size, or it will trigger assertion in abd_verify() when that gang is freed. Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14816 commit aefb80389458dcccdcb9659914714264248b8e52 Author: Pawel Jakub Dawidek <[email protected]> Date: Sat May 6 01:09:12 2023 +0900 Simplify and optimize random_int_between(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14805 commit cf53b4376d902baecc04e450038d49c84c848e56 Author: Pawel Jakub Dawidek <[email protected]> Date: Sat May 6 00:51:41 2023 +0900 Plug memory leak in zfsdev_state. On kernel module unload, free all zfsdev state structures, except for zfsdev_state_listhead, which is statically allocated. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14824 commit 409f6b6fa0caba14be1995bbe28ca70e55ab7666 Author: Ameer Hamza <[email protected]> Date: Thu May 4 03:10:32 2023 +0500 zpool import -m also removing spare and cache when log device is missing spa_import() relies on a pool config fetched by spa_try_import() for spare/cache devices. Import flags are not passed to spa_tryimport(), which makes it return early due to a missing log device and missing retrieving the cache dev…

commit 9cde9c07739f76a37d729d3a323f49f5d4bc100f Author: Andrew Innes <[email protected]> Date: Wed Jun 28 19:27:10 2023 +0800 Revert various glitches Signed-off-by: Andrew Innes <[email protected]> commit d0c8c0fb05088bb016bc208d5f8cb709195cff87 Author: Andrew Innes <[email protected]> Date: Thu Jun 29 08:24:13 2023 +0800 Windows: znode: expose zfs_get_zplprop to libzpool Signed-off-by: Andrew Innes <[email protected]> commit 3d747f29b2864b661223d09bc8375d34e2105825 Author: Richard Yao <[email protected]> Date: Sun Dec 4 17:42:43 2022 -0500 Fix TOCTOU race in zpool_do_labelclear() Coverity reported a TOCTOU race in `zpool_do_labelclear()`. This is not believed to be a real security issue, but fixing it reduces the number of syscalls we do and will prevent other static analyzers from complaining about this. The code is expected to be equivalent. However, under rare circumstances, such as ELOOP, ENAMETOOLONG, ENOMEM, ENOTDIR and EOVERFLOW, we will display the error message that we currently display for the `open()` syscall rather than the one that we currently display for the `stat()` syscall. This is considered to be an improvement. Reported-by: Coverity (CID-1524188) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14575 commit 1e255365c9bf0e7858561d527c0ebdf8f90bc925 Author: Alexander Motin <[email protected]> Date: Tue Jun 27 20:03:37 2023 -0400 ZIL: Fix another use-after-free. lwb->lwb_issued_txg can not be accessed after lwb_state is set to LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be freed by zil_sync(). We must save the txg number before that. This is similar to the 55b1842f92, but as I see the bug is not new. It existed for quite a while, just was not triggered due to smaller race window. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14988 Closes #14999 commit 233893e7cb7a98895061100ef8363f0ac30204b5 Author: Alexander Motin <[email protected]> Date: Tue Jun 27 20:00:30 2023 -0400 Use big transactions for small recordsize writes. When ZFS appends files in chunks bigger than recordsize, it borrows buffer from ARC and fills it before opening transaction. This supposed to help in case of page faults to not hold transaction open indefinitely. The problem appears when recordsize is set lower than default 128KB. Since each block is committed in separate transaction, per-transaction overhead becomes significant, and what is even worse, active use of of per-dataset and per-pool locks to protect space use accounting for each transaction badly hurts the code SMP scalability. The same transaction size limitation applies in case of file rewrite, but without even excuse of buffer borrowing. To address the issue, disable the borrowing mechanism if recordsize is smaller than default and the write request is 4x bigger than it. In such case writes up to 32MB are executed in single transaction, that dramatically reduces overhead and lock contention. Since the borrowing mechanism is not used for file rewrites, and it was never used by zvols, which seem to work fine, I don't think this change should create significant problems, partially because in addition to the borrowing mechanism there are also used pre-faults. My tests with 4/8 threads writing several files same time on datasets with 32KB recordsize in 1MB requests show reduction of CPU usage by the user threads by 25-35%. I would measure it in GB/s, but at that block size we are now limited by the lock contention of single write issue taskqueue, which is a separate problem we are going to work on. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14964 commit aea27422747921798a9b9e1b8e0f6230d5672ba5 Author: Laevos <[email protected]> Date: Tue Jun 27 16:58:32 2023 -0700 Remove unnecessary commas in zpool-create.8 Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Laevos <[email protected]> Closes #15011 commit 38a821c0d8f6bb51a866354e76078abf6a6ba1fc Author: Alexander Motin <[email protected]> Date: Tue Jun 27 12:09:48 2023 -0400 Another set of vdev queue optimizations. Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from time-sorted AVL-trees to simple lists. AVL-trees are too expensive for such a simple task. To change I/O priority without searching through the trees, add io_queue_state field to struct zio. To not check number of queued I/Os for each priority add vq_cqueued bitmap to struct vdev_queue. Update it when adding/removing I/Os. Make vq_cactive a separate array instead of struct vdev_queue_class member. Together those allow to avoid lots of cache misses when looking for work in vdev_queue_class_to_issue(). Introduce deadline of ~0.5s for LBA-sorted queues. Before this I saw some I/Os waiting in a queue for up to 8 seconds and possibly more due to starvation. With this change I no longer see it. I had to slightly more complicate the comparison function, but since it uses all the same cache lines the difference is minimal. For a sequential I/Os the new code in vdev_queue_io_to_issue() actually often uses more simple avl_first(), falling back to avl_find() and avl_nearest() only when needed. Arrange members in struct zio to access only one cache line when searching through vdev queues. While there, remove io_alloc_node, reusing the io_queue_node instead. Those two are never used same time. Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4 years since implemented, while still wasted time maintaining the offset-sorted tree of TRIM requests. Just remove the tree. Remove locking from txg_all_lists_empty(). It is racy by design, while 2 pair of locks/unlocks take noticeable time under the vdev queue lock. With these changes in my tests with volblocksize=4KB I measure vdev queue lock spin time reduction by 50% on read and 75% on write. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14925 commit 1737e75ab4e09a2d20e7cc64fa83dae047a302e9 Author: Rich Ercolani <[email protected]> Date: Mon Jun 26 16:57:12 2023 -0400 Add a delay to tearing down threads. It's been observed that in certain workloads (zvol-related being a big one), ZFS will end up spending a large amount of time spinning up taskqs only to tear them down again almost immediately, then spin them up again... I noticed this when I looked at what my mostly-idle system was doing and wondered how on earth taskq creation/destroy was a bunch of time... So I added a configurable delay to avoid it tearing down tasks the first time it notices them idle, and the total number of threads at steady state went up, but the amount of time being burned just tearing down/turning up new ones almost vanished. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14938 commit 68b8e2ffab23cba6ae87f18c59b044c833934f2f Author: Alexander Motin <[email protected]> Date: Sat Jun 17 22:51:37 2023 -0400 Fix memory leak in zil_parse(). 482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly leaking up to 128KB of memory per dataset during ZIL replay. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14987 commit ea0d03a8bd040e438bcaa43b8e449cbf717e14f3 Author: George Amanakis <[email protected]> Date: Thu Jun 15 21:45:36 2023 +0200 Shorten arcstat_quiescence sleep time With the latest L2ARC fixes, 2 seconds is too long to wait for quiescence of arcstats like l2_size. Shorten this interval to avoid having the persistent L2ARC tests in ZTS prematurely terminated. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14981 commit 3fa141285b8105b3cc11c1296b77ad6d24250f2c Author: Alexander Motin <[email protected]> Date: Thu Jun 15 13:49:03 2023 -0400 Remove ARC/ZIO physdone callbacks. Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14948 commit 9efc735904d194987f06870f355e08d94e39ab81 Author: Brian Behlendorf <[email protected]> Date: Wed Jun 14 10:04:05 2023 -0500 ZTS: Skip send_raw_ashift on FreeBSD On FreeBSD 14 this test runs slowly in the CI environment and is killed by the 10 minute timeout. Skip the test on FreeBSD until the slow down is resolved. Signed-off-by: Brian Behlendorf <[email protected]> Issue #14961 commit 9c54894bfc77f585806984f44c70a839543e6715 Author: Alexander Motin <[email protected]> Date: Wed Jun 14 11:02:27 2023 -0400 Switch refcount tracking from lists to AVL-trees. With large number of tracked references list searches under the lock become too expensive, creating enormous lock contention. On my tests with ZFS_DEBUG enabled this increases write throughput with 32KB blocks from ~1.2GB/s to ~7.5GB/s. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14970 commit 4e62540827a6ed15e08b2a627896d24bc661fa38 Author: George Amanakis <[email protected]> Date: Wed Jun 14 17:01:17 2023 +0200 Store the L2ARC device ashift in the vdev label If this is not done, and the pool has an ashift other than the default (at the moment 9) then the following happens: 1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but upon export it is not stored anywhere 2) at the first import, vdev_open() sees an vdev_ashift() of 0 and assigns the logical_ashift, which is 9 3) reading the contents of L2ARC, including the header fails 4) L2ARC buffers are not restored in ARC. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14313 Closes #14963 commit adaa3e64ea46f21cc5f544228c48363977b7733e Author: George Amanakis <[email protected]> Date: Sat Jun 10 02:05:47 2023 +0200 Fix the L2ARC write size calculating logic (2) While commit bcd5321 adjusts the write size based on the size of the log block, this happens after comparing the unadjusted write size to the evicted (target) size. In this case l2ad_hand will exceed l2ad_evict and violate an assertion at the end of l2arc_write_buffers(). Fix this by adding the max log block size to the allocated size of the buffer to be committed before comparing the result to the target size. Also reset the l2arc_trim_ahead ZFS module variable when the adjusted write size exceeds the size of the L2ARC device. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14936 Closes #14954 commit 67118a7d6e74a6e818127096162478017610d13e Author: Andrew Innes <[email protected]> Date: Wed Jun 28 12:31:10 2023 +0800 Windows: Finally drop long disabled vdev cache. Signed-off-by: Andrew Innes <[email protected]> commit 5d80c98c28c931339138753a4e4c1156dbf951f4 Author: Alexander Motin <[email protected]> Date: Fri Jun 9 15:40:55 2023 -0400 Finally drop long disabled vdev cache. It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14953 commit 1f1ab33781b5736654b988e2e618ea79788fa1f7 Author: Brian Behlendorf <[email protected]> Date: Fri Jun 9 11:10:01 2023 -0700 ZTS: Skip checkpoint_discard_busy Until the ASSERT which is occasionally hit while running checkpoint_discard_busy is resolved skip this test case. Signed-off-by: Brian Behlendorf <[email protected]> Issue #12053 Closes #14952 commit b94049c2cbedbbe2af8e629bf974a6ed93f11acb Author: Alexander Motin <[email protected]> Date: Fri Jun 9 13:14:05 2023 -0400 Improve l2arc reporting in arc_summary. - Do not report L2ARC as FAULTED in presence of in-flight writes. - Report read and write I/Os, bytes and errors. - Remove few numbers not important to average user. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #12304 Closes #14946 commit 31044b5cfb6f91d376034c4d6374f61baaf03232 Author: Andrew Innes <[email protected]> Date: Wed Jun 28 12:00:39 2023 +0800 Windows: Use list_remove_head() where possible. Signed-off-by: Andrew Innes <[email protected]> commit 32eda54d0d75a94b6aa71dc80aa958095feb8011 Author: Alexander Motin <[email protected]> Date: Fri Jun 9 13:12:52 2023 -0400 Use list_remove_head() where possible. ... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14955 commit fe7693a3f87229d1ae93b5ce2bb84d8bb86a9f5c Author: Alexander Motin <[email protected]> Date: Fri Jun 9 13:08:05 2023 -0400 ZIL: Fix race introduced by f63811f0721. We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE state and dropping zl_lock, since it may be freed by zil_sync(). To free itxs and waiters after dropping the lock we need to move lwb_itxs and lwb_waiters lists elements to local storage. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14957 Closes #14959 commit 44c5a0c92f98e8c21221bd7051729d1947a10736 Author: Rich Ercolani <[email protected]> Date: Wed Jun 7 14:14:05 2023 -0400 Revert "systemd: Use non-absolute paths in Exec* lines" This reverts commit 79b20949b25c8db4d379f6486b0835a6613b480c since it doesn't work with the systemd version shipped with RHEL7-based systems. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14943 Closes #14945 commit ba5af00257eb4eb3363f297819a21c4da811392f Author: Brian Behlendorf <[email protected]> Date: Wed Jun 7 10:43:43 2023 -0700 Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926) When a kmem cache is exhausted and needs to be expanded a new slab is allocated. KM_SLEEP callers can block and wait for the allocation, but KM_NOSLEEP callers were incorrectly allowed to block as well. Resolve this by attempting an emergency allocation as a best effort. This may fail but that's fine since any KM_NOSLEEP consumer is required to handle an allocation failure. Signed-off-by: Brian Behlendorf <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Tony Hutter <[email protected]> commit d4ecd4efde1692641d1d0b89851e7a15e90632f8 Author: George Amanakis <[email protected]> Date: Tue Jun 6 21:32:37 2023 +0200 Fix the L2ARC write size calculating logic l2arc_write_size() should return the write size after adjusting for trim and overhead of the L2ARC log blocks. Also take into account the allocated size of log blocks when deciding when to stop writing buffers to L2ARC. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14939 commit 8692ab174e18faf444681d67d7ea4418600553cc Author: Rob Norris <[email protected]> Date: Wed Mar 15 18:18:10 2023 +1100 zdb: add -B option to generate backup stream This is more-or-less like `zfs send`, but specifying the snapshot by its objset id for situations where it can't be referenced any other way. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14642 commit df84ca3f3bf9f265ebc76de17394df529fd07af6 Author: Andrew Innes <[email protected]> Date: Wed Jun 28 11:05:55 2023 +0800 Windows: znode: expose zfs_get_zplprop to libzpool Signed-off-by: Andrew Innes <[email protected]> commit 944c58247a13a92c9e4ffb2c0a9e6b6293dca37e Author: Rob Norris <[email protected]> Date: Sun Jun 4 11:14:20 2023 +1000 znode: expose zfs_get_zplprop to libzpool There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: WHR <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #14642 commit 429f58cdbb195c8d50ed95c7309ee54d37526b70 Author: Alexander Motin <[email protected]> Date: Mon Jun 5 14:51:44 2023 -0400 Introduce zfs_refcount_(add|remove)_few(). There are two places where we need to add/remove several references with semantics of zfs_refcount_(add|remove). But when debug/tracing is disabled, it is a crime to run multiple atomic_inc() in a loop, especially under congested pool-wide allocator lock. Introduced new functions implement the same semantics as the loop, but without overhead in production builds. Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14934 commit 077c2f359feb69a13bee37ac4220d271d1c7bf27 Author: Brian Behlendorf <[email protected]> Date: Mon Jun 5 11:08:24 2023 -0700 Linux 6.3 compat: META (#14930) Update the META file to reflect compatibility with the 6.3 kernel. Signed-off-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> commit c2fcd6e484107fc7435087771757e88ba84f6093 Author: Graham Perrin <[email protected]> Date: Fri Jun 2 19:25:13 2023 +0100 zfs-create(8): ZFS for swap: caution, clarity Make the section heading more generic (the section relates to ZFS files as well as ZFS volumes). Swapping to a ZFS volume is prone to deadlock. Remove the related instruction, direct readers to OpenZFS FAQ. Related, but not linked from within the manual page: <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux> (Using a zvol for a swap device on Linux). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Graham Perrin <[email protected]> Issue #7734 Closes #14756 commit 251dbe83e14085a26100aa894d79772cbb69dcda Author: Alexander Motin <[email protected]> Date: Fri Jun 2 14:01:58 2023 -0400 ZIL: Allow to replay blocks of any size. There seems to be no reason for ZIL blocks to be limited by 128KB other than replay code is written in such a way. This change does not increase the limit yet, just removes the artificial limitation. Avoided extra memcpy() may save us a second during replay. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14910 commit 76170249d538965655dbd3206cd59566b1d3944b Author: Val Packett <[email protected]> Date: Thu May 11 18:16:57 2023 -0300 PAM: enable testing on FreeBSD Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit d1b68a45441cae8c399a8a3ed60b29726ed031ff Author: Val Packett <[email protected]> Date: Fri May 5 22:17:12 2023 -0300 PAM: support password changes even when not mounted There's usually no requirement that a user be logged in for changing their password, so let's not be surprising here. We need to use the fetch_lazy mechanism for the old password to avoid a double prompt for it, so that mechanism is now generalized a bit. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit 7424feff72f1e17ea27bcfe0d36cabce7c732eea Author: Val Packett <[email protected]> Date: Fri May 5 22:34:58 2023 -0300 PAM: add 'uid_min' and 'uid_max' options for changing the uid range Instead of a fixed >=1000 check, allow the configuration to override the minimum UID and add a maximum one as well. While here, add the uid range check to the authenticate method as well, and fix the return in the chauthtok method (seems very wrong to report success when we've done absolutely nothing). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit fc9e012f5fc7e7997acee2b6d8d759622b319f0e Author: Val Packett <[email protected]> Date: Fri May 5 22:02:13 2023 -0300 PAM: add 'forceunmount' flag Probably not always a good idea, but it's nice to have the option. It is a workaround for FreeBSD calling the PAM session end earier than the last process is actually done touching the mount, for example. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit a39ed83bd31cc0c8c98dc3c4cc3d11b03d9af620 Author: Val Packett <[email protected]> Date: Fri May 5 19:35:57 2023 -0300 PAM: add 'recursive_homes' flag to use with 'prop_mountpoint' It's not always desirable to have a fixed flat homes directory. With the 'recursive_homes' flag, 'prop_mountpoint' search would traverse the whole tree starting at 'homes' (which can now be '*' to mean all pools) to find a dataset with a mountpoint matching the home directory. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit 7f8d5ef815b7559fcc671ff2add33ba9c2a74867 Author: Val Packett <[email protected]> Date: Fri May 5 21:56:39 2023 -0300 PAM: use boolean_t for config flags Since we already use boolean_t in the file, we can use it here. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit e2872932c85189f06a68f0ad10bd8eb6895d79c2 Author: Val Packett <[email protected]> Date: Fri May 5 20:00:48 2023 -0300 PAM: do not fail to mount if the key's already loaded If we're expecting a working home directory on login, it would be rather frustrating to not have it mounted just because it e.g. failed to unmount once on logout. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Felix Dörre <[email protected]> Signed-off-by: Val Packett <[email protected]> Closes #14834 commit b897137e2044c3ef6120820f753d940b7dfb58be Author: Rich Ercolani <[email protected]> Date: Wed May 31 19:58:41 2023 -0400 Revert "initramfs: use `mount.zfs` instead of `mount`" This broke mounting of snapshots on / for users. See https://github.com/openzfs/zfs/issues/9461#issuecomment-1376162949 for more context. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #14908 commit 10cde4f8f60d4d55887d7122a5742e6e4f90280c Author: Luís Henriques <[email protected]> Date: Tue May 30 23:15:24 2023 +0100 Fix NULL pointer dereference when doing concurrent 'send' operations A NULL pointer will occur when doing a 'zfs send -S' on a dataset that is still being received. The problem is that the new 'send' will rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will fail), but then dmu_send() will still do the dsl_dataset_disown(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Luís Henriques <[email protected]> Closes #14903 Closes #14890 commit 12452d79a3fd29af1dc0b95f3e367e3ce339702b Author: Brian Behlendorf <[email protected]> Date: Mon May 29 12:55:35 2023 -0700 ZTS: zvol_misc_trim disable blk mq Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted kernels. This issue is being actively worked in #14872 and as part of that fix this commit will be reverted. VERIFY(zh->zh_claim_txg == 0) failed PANIC at zil.c:904:zil_create() Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14872 Closes #14870 commit 803c04f233e60a2d23f0463f299eba96c0968602 Author: Richard Yao <[email protected]> Date: Fri May 26 18:47:52 2023 -0400 Use __attribute__((malloc)) on memory allocation functions This informs the C compiler that pointers returned from these functions do not alias other functions, which allows it to do better code optimization and should make the compiled code smaller. References: https://stackoverflow.com/a/53654773 https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute https://clang.llvm.org/docs/AttributeReference.html#malloc Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14827 commit 64d8bbe15f77876ae9639b9971a743776a41bf9a Author: Brian Behlendorf <[email protected]> Date: Fri May 26 15:39:23 2023 -0700 ZTS: Add zpool_resilver_concurrent exception The zpool_resilver_concurrent test case requires the ZED which is not used on FreeBSD. Add this test to the known list of skipped tested for FreeBSD. Signed-off-by: Brian Behlendorf <[email protected]> Closes #14904 commit e396d30d29ed131194605222e6ba1fec1ef8b2ca Author: Mike Swanson <[email protected]> Date: Fri May 26 15:37:15 2023 -0700 Add compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2} Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Mike Swanson <[email protected]> Closes #14902 commit f6dd0b8c1cc41707d299b7123f80912f43d03340 Author: Colm <[email protected]> Date: Fri May 26 10:04:19 2023 -0700 Adding new read-only compatible zpool features to compatibility.d/grub2 GRUB2 is compatible with all "read-only compatible" features, so it is safe to add new features of this type to the grub2 compatibility list. We generally want to include all compatible features, to minimize the differences between grub2-compatible pools and no-compatibility pools. Adding new properties `livelist` and `zpool_checkpoint` accordingly. Also adding them to the man page which references this file as an example, for consistency. Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Colm Buckley <[email protected]> Closes #14893 commit 013d3a1e0e00d83dabe70837b23dab48c1bac592 Author: Richard Yao <[email protected]> Date: Fri May 26 13:03:12 2023 -0400 btree: Implement faster binary search algorithm This implements a binary search algorithm for B-Trees that reduces branching to the absolute minimum necessary for a binary search algorithm. It also enables the compiler to inline the comparator to ensure that the only slowdown when doing binary search is from waiting for memory accesses. Additionally, it instructs the compiler to unroll the loop, which gives an additional 40% improve with Clang and 8% improvement with GCC. Consumers must opt into using the faster algorithm. At present, only B-Trees used inside kernel code have been modified to use the faster algorithm. Micro-benchmarks suggest that this can improve binary search performance by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when compiling with GCC 12.2. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14866 commit 1854df330aa57cda39f076e8ab11e17ca3697bb8 Author: George Amanakis <[email protected]> Date: Fri May 26 18:53:00 2023 +0200 Fix inconsistent definition of zfs_scrub_error_blocks_per_txg Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14894 commit 8735e6ac03742fcf43adde3ce127af698a32c53a Author: Damiano Albani <[email protected]> Date: Fri May 26 01:10:54 2023 +0200 Add missing files to Debian DKMS package Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Umer Saleem <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Damiano Albani <[email protected]> Closes #14887 Closes #14889 commit d439021bd05a5cc0bb271a5470abb67af2f7bcda Author: Brian Behlendorf <[email protected]> Date: Thu May 25 13:53:08 2023 -0700 Update compatibility.d files Add an openzfs-2.2 compatibility file for the next release. Edon-R support has been enabled for FreeBSD removing the need for different FreeBSD and Linux files. Symlinks for the -linux and -freebsd names are created for any scripts expecting that convention. Additionally, a symlink for ubunutu-22.04 was added. Signed-off-by: Brian Behlendorf <[email protected]> Closes #14833 commit da54d5f3f9576b958e3eadf4f4d8f68c91b3d6e4 Author: Alexander Motin <[email protected]> Date: Thu May 25 16:51:53 2023 -0400 zil: Add some more statistics. In addition to a number of actual log bytes written, account also a total written bytes including padding and total allocated bytes (bytes <= write <= alloc). It should allow to monitor zil traffic and space efficiency. Add dtrace probe for zil block size selection. Make zilstat report more information and fit it into less width. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14863 commit faa4955023d089668bd6c564c195a933d1eac455 Author: Alexander Motin <[email protected]> Date: Thu May 25 12:48:43 2023 -0400 ZIL: Reduce scope of per-dataset zl_issuer_lock. Before this change ZIL copied all log data while holding the lock. It caused huge lock contention on workloads with many big parallel writes. This change splits the process into two parts: first, zil_lwb_assign() estimates the log space needed for all transactions, and zil_lwb_write_close() allocates blocks and zios while holding the lock, then, after the lock in dropped, zil_lwb_commit() copies the data, and zil_lwb_write_issue() issues the I/Os. Also while there slightly reduce scope of zl_lock. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14841 commit f77b9f7ae83834ade1da21cfc16b8a273df3acfc Author: Dimitri John Ledkov <[email protected]> Date: Wed May 24 20:31:28 2023 +0100 systemd: Use non-absolute paths in Exec* lines Since systemd v239, Exec* binaries are resolved from PATH when they are not-absolute. Switch to this by default for ease of downstream maintenance. Many downstream distributions move individual binaries to locations that existing compile-time configurations cannot accommodate. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Dimitri John Ledkov <[email protected]> Closes #14880 commit 4bfb9d28cffd4dfeb4b91359b497d100f668bb34 Author: Akash B <[email protected]> Date: Thu May 25 00:58:09 2023 +0530 Fix concurrent resilvers initiated at same time For draid vdevs it was possible to initiate both the sequential and healing resilver at same time. This fixes the following two scenarios. 1) There's a window where a sequential rebuild can be started via ZED even if a healing resilver has been scheduled. - This is fixed by adding additional check in spa_vdev_attach() for any scheduled resilver and return appropriate error code when a resilver is already in progress. 2) It was possible for zpool clear to start a healing resilver when it wasn't needed at all. This occurs because during a vdev_open() the device is presumed to be healthy not until the device is validated by vdev_validate() and it's set unavailable. However, by this point an async resilver will have already been requested if the DTL isn't empty. - This is fixed by cancelling the SPA_ASYNC_RESILVER request immediately at the end of vdev_reopen() when a resilver is unneeded. Finally, added a testcase in ZTS for verification. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Dipak Ghosh <[email protected]> Signed-off-by: Akash B <[email protected]> Closes #14881 Closes #14892 commit c9bb406d177a00aa1f0058d29aeb29e478223273 Author: youzhongyang <[email protected]> Date: Wed May 24 15:23:42 2023 -0400 Linux 6.4 compat: reclaimed_slab renamed to reclaimed Reviewed-by: Richard Yao <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Youzhong Yang <[email protected]> Closes #14891 commit 79e61a873b136f13fcf140beb925ceddc1f94767 Author: Brian Atkinson <[email protected]> Date: Fri May 19 16:05:53 2023 -0400 Hold db_mtx when updating db_state Commit 555ef90 did some general code refactoring for dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was not held when update db->db_state in those code block. The rest of the dbuf code always holds the db_mtx when updating db_state. This is important because cv_wait() db_changed is used to check for db_state changes. Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the db_mtx when updating db_state. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes #14875 commit d7be0cdf93a568b6c9b4a4e15a88a5d88ebbb764 Author: Brian Behlendorf <[email protected]> Date: Fri May 19 13:05:09 2023 -0700 Probe vdevs before marking removed Before allowing the ZED to mark a vdev as REMOVED due to a hotplug event confirm that it is non-responsive with probe. Any device which can be successfully probed should be left ONLINE to prevent a healthy pool from being incorrectly SUSPENDED. This may occur for at least the following two scenarios. 1) Drive expansion (zpool online -e) in VMware environments. If, during the partition resize operation, a partition is removed and re-created then udev will send a removed event. 2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan) may result in a udev remove and add event being delivered. Finally, update the ZED to only kick in a spare when the removal was successful. Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14859 Closes #14861 commit 054bb22686045ea1499065a4456568f0c21d939b Author: Andrew Innes <[email protected]> Date: Tue Jun 27 09:20:56 2023 +0800 Windows: Teach zpool scrub to scrub only blocks in error log Signed-off-by: Andrew Innes <[email protected]> commit b61e89a3e68ae19819493183ff3d1fe7bf4ffe2b Author: George Amanakis <[email protected]> Date: Fri Dec 17 21:35:28 2021 +0100 Teach zpool scrub to scrub only blocks in error log Added a flag '-e' in zpool scrub to scrub only blocks in error log. A user can pause, resume and cancel the error scrub by passing additional command line arguments -p -s just like a regular scrub. This involves adding a new flag, creating new libzfs interfaces, a new ioctl, and the actual iteration and read-issuing logic. Error scrubbing is executed in multiple txg to make sure pool performance is not affected. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: TulsiJain [email protected] Signed-off-by: George Amanakis <[email protected]> Closes #8995 Closes #12355 commit 61bfb3cb5dd792ec7ca0fbfca59b165f3ddbe1f5 Author: Brian Behlendorf <[email protected]> Date: Thu May 18 10:02:20 2023 -0700 Add the ability to uninitialize zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12451 Closes #14873 commit 855b62942d4ca5dab3d65b7000f9d284fd1560bb Author: Antonio Russo <[email protected]> Date: Mon May 15 17:11:33 2023 -0600 test-runner: pass kmemleak and kmsg to Cmd.run test-runner.py orchestrates all of the ZTS executions. The `Cmd` object manages these process, and its `run` method specifically invokes these possibly long-running processes, possibly retrying in the event of a timeout. Since its inception, memory leak detection using the kmemleak infrastructure [1], and kernel logging [2] have been added to this run mechanism. However, the callback to cull a process beyond its timeout threshold, `kill_cmd`, has evaded modernization by both of these changes. As a result, this function fails to properly invoke `run`, leading to an untrapped exception and unreported test failure. This patch extends `kill_cmd` to receive these kernel devices through the `options` parameter, and regularizes all the `.run` calls from `Cmd`, and its subclasses, to accept that parameter. [1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1 [2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba Reviewed-by: John Wren Kennedy <[email protected]> Signed-off-by: Antonio Russo <[email protected]> Closes #14849 commit 537939565123fd2afa097e9a56ee3efd28779e5f Author: Richard Yao <[email protected]> Date: Fri May 12 17:10:14 2023 -0400 Fix undefined behavior in spa_sync_props() 8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin complaining about "Improper use of negative value" in two places in spa_sync_props() because Coverity correctly inferred from `prop == ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name() and zpool_prop_get_type() use it an array index, which is undefined behavior. Assuming that the system does not panic from an attempt to read invalid memory, the case statement for ZPOOL_PROP_INVAL will ensure that only user properties will reach this code when prop is ZPOOL_PROP_INVAL, such that execution will continue safely. However, if we are unlucky enough to read invalid memory, then the system will panic. This issue predates the patch that caused coverity to begin complaining. Thankfully, our userland tools do not pass nonsense to us, so this bug should not be triggered unless a future userland tool attempts to set a property that we do not understand. Reported-by: Coverity (CID-1561129) Reported-by: Coverity (CID-1561130) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14860 commit 02351b380f0430980bfb92e83d0800df104bd06a Author: Richard Yao <[email protected]> Date: Fri May 12 16:47:56 2023 -0400 Fix use after free regression in spa_remove_healed_errors() 6839ec6f1098c28ff7b772f1b31b832d05e6b567 placed code in spa_remove_healed_errors() that uses a pointer after the kmem_free() call that frees it. Reported-by: Coverity (CID-1562375) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #14860 commit e9b315ffb79ff6419694a2713fcd5fd448317904 Author: Andrew Innes <[email protected]> Date: Mon May 15 13:52:35 2023 +0800 Use python3 on windows commit 3346a5b78c2db15801ce54a70a323952fdf67fa5 Author: Jorgen Lundman <[email protected]> Date: Thu Jun 22 08:56:38 2023 +0900 zfs_write() ignores errors If files were advanced by zfs_freesp() we ignored any errors returned by it. Signed-off-by: Jorgen Lundman <[email protected]> commit cce49c08316bc6a5dff287f4fa15856e26d5b18a Author: Jorgen Lundman <[email protected]> Date: Thu Jun 22 08:55:55 2023 +0900 Correct Stream event path The Stream path events used the incorrect name "stream", now uses "file.txt:stream" as per ntfs. Signed-off-by: Jorgen Lundman <[email protected]> commit 0f83d31e288d789fb4e10a7e4b12e27887820498 Author: Jorgen Lundman <[email protected]> Date: Wed Jun 21 14:30:13 2023 +0900 Add stub for file_hard_link_information() Signed-off-by: Jorgen Lundman <[email protected]> commit 8d6db9490364e4d281546445571d2ca9d5abda22 Author: Jorgen Lundman <[email protected]> Date: Wed Jun 21 14:29:43 2023 +0900 Return correct FileID in dirlist Signed-off-by: Jorgen Lundman <[email protected]> commit 4c011397229e3c38259d6956458a4fd287dca72d Author: Andrew Innes <[email protected]> Date: Wed Jun 21 10:17:30 2023 +0800 Fix logic (#232) Signed-off-by: Andrew Innes <[email protected]> commit 467436b676ad897025b7ed90d8f033969da441cc Author: Andrew Innes <[email protected]> Date: Wed Jun 21 09:47:38 2023 +0800 Run winbtrfs tests by default (#231) Signed-off-by: Andrew Innes <[email protected]> commit 56eca2a5d116c66b10579f9cf6d5f271991c7e2e Author: Jorgen Lundman <[email protected]> Date: Wed Jun 21 09:54:00 2023 +0900 SetFilePositionInformation SetFileValidDataLengthInformation Signed-off-by: Jorgen Lundman <[email protected]> commit b4fbbda470f27aee565dfa9bc0d68217b969339c Author: Andrew Innes <[email protected]> Date: Tue Jun 20 16:33:12 2023 +0800 Add sleep to tests (#230) Signed-off-by: Andrew Innes <[email protected]> commit 94f1f52807d1f8c0c2931e9e52b91f0ce5e488f4 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 20 16:53:50 2023 +0900 CreateFile of newfile:newstream should create both In addition, many more stream fixes, illegal chars, and names Signed-off-by: Jorgen Lundman <[email protected]> commit 894d512880d39ecf40e841c6d7b73157dfe397e0 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 20 08:41:37 2023 +0900 Windows streams should return parent file ID When asked for File ID of a stream, it should return the FileID of the parent file, which is two levels up. Signed-off-by: Jorgen Lundman <[email protected]> commit 0cc45d2154a2866b2f494c3790a57555c29e60c3 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 20 08:32:44 2023 +0900 Support FILE_STANDARD_INFORMATION_EX Signed-off-by: Jorgen Lundman <[email protected]> commit a6edd02999d581db56f4a53567f4c5db11778f64 Author: Jorgen Lundman <[email protected]> Date: Mon Jun 19 10:36:13 2023 +0900 Add xattr compat code from upstream and adjust calls to new API calls. This adds xattr=sa support to Windows. Signed-off-by: Jorgen Lundman <[email protected]> commit 0e1476a3942990385d32c02403ebe2c815d567db Author: Jorgen Lundman <[email protected]> Date: Wed Jun 14 11:56:09 2023 +0900 Set EA can panic Signed-off-by: Jorgen Lundman <[email protected]> commit 4a1adef6b8c2851195d692a42d5718c9a1b03490 Author: Jorgen Lundman <[email protected]> Date: Wed Jun 14 09:49:57 2023 +0900 Incorrect MAXPATH used in delete entry Signed-off-by: Jorgen Lundman <[email protected]> commit 2c0d119e37cb3eed1acac90efa9fe0f8c173e0f0 Author: Jorgen Lundman <[email protected]> Date: Tue Jun 13 16:19:42 2023 +0900 Large changes fixing FS notify events Some incorrect behavior still, query name of a stream is wrong. Signed-off-by: Jorgen Lundman <[email protected]> commit 5b2b2b0550a493497a0b460206079fd57c639543 Author: Jorgen Lundman <[email protected]> Date: Tue May 16 14:42:52 2023 +0900 file name and file full information buffer overrun When a buffer is not big enough, we would still null terminate on the full string, beyond the supplied buffer. Signed-off-by: Jorgen Lundman <[email protected]> commit 94bfb92951a5ccdef7b2a1fb818fafdafbc4fff0 Author: Jorgen Lundman <[email protected]> Date: Tue May 16 11:48:12 2023 +0900 Correct Query EA and Query Streams Which includes: * NextEntryOffset is not offset from Buffer, but from one struct to the next struct. * Pack only complete EAs, and return Overflow if does not fit * query file EA information would return from Information=size * Call cleareaszie on VP when EAs have changed Signed-off-by: Jorgen Lundman <[email protected]> commit 9c7a4071fcfc99c3308620fc1943355f9ade34b3 Author: Alexander Motin <[email protected]> Date: Fri May 12 12:49:26 2023 -0400 zil: Free lwb_buf after write completion. There is no sense to keep that memory allocated during the flush. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14855 commit 7e91b3222ddaadc10c92d1065529886dd3806acc Author: Alexander Motin <[email protected]> Date: Fri May 12 12:14:29 2023 -0400 zil: Some micro-optimizations. Should not cause functional changes. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14854 commit 6b62c3b0e10de782c3aef0e1206aa48875519c4e Author: Don Brady <[email protected]> Date: Fri May 12 10:12:28 2023 -0600 Refine special_small_blocks property validation When the special_small_blocks property is being set during a pool create it enforces a limit of 128KiB even if the pool's record size is larger. If the recordsize property is being set during a pool create, then use that value instead of the default SPA_OLD_MAXBLOCKSIZE value. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #13815 Closes #14811 commit d0ab2dddde618c394fa7fe88211276786ba8ca12 Author: Brian Behlendorf <[email protected]> Date: Fri May 12 09:07:58 2023 -0700 ZTS: Add auto_replace_001_pos to exceptions The auto_replace_001_pos test case does not reliably pass on Fedora 37 and newer. Until the test case can be updated to make it reliable add it to the list of "maybe" exceptions on Linux. Signed-off-by: Brian Behlendorf <[email protected]> Issue #14851 Closes #14852 commit 1e3e7a103a5026e9a2005acec7017e4024d95115 Author: Pawel Jakub Dawidek <[email protected]> Date: Tue May 9 22:32:30 2023 -0700 Make sure we are not trying to clone a spill block. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit a22891c3272d8527d4c8cb7ff52a25ef396e7add Author: Pawel Jakub Dawidek <[email protected]> Date: Thu May 4 16:14:19 2023 -0700 Correct comment. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 9b016166dd5875db87963b5deeca8eeda094b571 Author: Pawel Jakub Dawidek <[email protected]> Date: Wed May 3 23:25:22 2023 -0700 Remove badly placed comment. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 6bcd48e213a279781ecd6df22799532cbec353d6 Author: Pawel Jakub Dawidek <[email protected]> Date: Wed May 3 00:24:47 2023 -0700 Don't call zfs_exit_two() before zfs_enter_two(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 0919c985e294a89169adacd5ed4a240945e5fbee Author: Pawel Jakub Dawidek <[email protected]> Date: Tue May 2 15:46:14 2023 -0700 Don't use dmu_buf_is_dirty() for unassigned transaction. The dmu_buf_is_dirty() call doesn't make sense here for two reasons: 1. txg is 0 for unassigned tx, so it was a no-op. 2. It is equivalent of checking if we have dirty records and we are doing this few lines earlier. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 7f88494ac91c61aeffad810e7d167badb875166e Author: Pawel Jakub Dawidek <[email protected]> Date: Tue May 2 14:24:43 2023 -0700 Deny block cloning is dbuf size doesn't match BP size. I don't know an easy way to shrink down dbuf size, so just deny block cloning into dbufs that don't match our BP's size. This fixes the following situation: 1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB. 2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB. 3. Truncate the large file to 0. Its dbuf will remain 2kB. 4. Clone the small file into the large file. Small file's BP lsize is 1kB, but the large file's dbuf is 2kB. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 49657002f9cb57b9b4675100aaf58e1e93984bbf Author: Pawel Jakub Dawidek <[email protected]> Date: Sun Apr 30 02:47:09 2023 -0700 Additional block cloning fixes. Reimplement some of the block cloning vs dbuf logic, mostly to fix situation where we clone a block and in the same transaction group we want to partially overwrite the clone. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Pawel Jakub Dawidek <[email protected]> Closes #14825 commit 4d31369d3055bf0cf1d4f3e1e7d43d745f2fd05f Author: Alexander Motin <[email protected]> Date: Thu May 11 17:27:12 2023 -0400 zil: Don't expect zio_shrink() to succeed. At least for RAIDZ zio_shrink() does not reduce zio size, but reduced wsz in that case likely results in writing uninitialized memory. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14853 commit 663dc5f616e6d0427207ffcf7a83dd02fe06a707 Author: Ameer Hamza <[email protected]> Date: Wed May 10 05:56:35 2023 +0500 Prevent panic during concurrent snapshot rollback and zvol read Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent release of the dnode, avoiding panic when a snapshot is rolled back in parallel during ongoing zvol read operation. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes #14839 commit 7375f4f61ca587f893435184f398a767ae52fbea Author: Tony Hutter <[email protected]> Date: Tue May 9 17:55:19 2023 -0700 pam: Fix "buffer overflow" in pam ZTS tests on F38 The pam ZTS tests were reporting a buffer overflow on F38, possibly due to F38 now setting _FORTIFY_SOURCE=3 by default. gdb and valgrind narrowed this down to a snprintf() buffer overflow in zfs_key_config_modify_session_counter(). I'm not clear why this particular snprintf() was being flagged as an overflow, but when I replaced it with an asprintf(), the test passed reliably. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes #14802 Closes #14842 commit 9d3ed831f309e28a9cad56c8b1520292dbad0d7b Author: Brian Behlendorf <[email protected]> Date: Tue May 9 09:03:10 2023 -0700 Add dmu_tx_hold_append() interface Provides an interface which callers can use to declare a write when the exact starting offset in not yet known. Since the full range being updated is not available only the first L0 block at the provided offset will be prefetched. Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14819 commit 2b6033d71da38015c885297d1ee6577871099744 Author: Brian Behlendorf <[email protected]> Date: Tue May 9 08:57:02 2023 -0700 Debug auto_replace_001_pos failures Reduced the timeout to 60 seconds which should be more than sufficient and allow the test to be marked as FAILED rather than KILLED. Also dump the pool status on cleanup. Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14829 commit f4adc2882fb162c82e9738c5d2d30e3ba8a66367 Author: George Amanakis <[email protected]> Date: Tue May 9 17:54:41 2023 +0200 Remove duplicate code in l2arc_evict() l2arc_evict() performs the adjustment of the size of buffers to be written on L2ARC unnecessarily. l2arc_write_size() is called right before l2arc_evict() and performs those adjustments. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14828 commit 9b2c182d291bbb3ece9ceb1c72800d238d19b2e7 Author: Alexander Motin <[email protected]> Date: Tue May 9 11:54:01 2023 -0400 Remove single parent assertion from zio_nowait(). We only need to know if ZIO has any parent there. We do not care if it has more than one, but use of zio_unique_parent() == NULL asserts that. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #14823 commit 4def61804c052a1235179e3a7c98305d8075e0e9 Author: George Amanakis <[email protected]> Date: Tue May 9 17:53:27 2023 +0200 Enable the head_errlog feature to remove errors In case check_filesystem() does not error out and does not report an error, remove that error block from error lists and logs without requiring a scrub. This can happen when the original file and all snapshots/clones referencing it have been removed. Otherwise zpool status will still report that "Permanent errors have been detected..." without actually reporting any of them. To implement this change the functions introduced in corrective receive were modified to take into account the head_errlog feature. Before this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: ============================= After this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: No known data errors ============================= Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14813 commit 3f2f9533ca8512ef515a73ac5661598a65b896b6 Author: George Amanakis <[email protected]> Date: Mon May 8 22:35:03 2023 +0200 Fixes in head_errlog feature with encryption For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of dsl_dataset_hold_obj() in order to enable access to the encryption keys (if loaded). This enables reporting of errors in encrypted filesystems which are not mounted but have their keys loaded. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes #14837 commit 288ea63effae3ba24fcb6dc412a3125b9f3e1da9 Author: Matthew Ahrens <[email protected]> Date: Mon May 8 11:20:23 2023 -0700 Verify block pointers before writing them out If a block pointer is corrupted (but the block containing it checksums correctly, e.g. due to a bug that overwrites random memory), we can often detect it before the block is read, with the `zfs_blkptr_verify()` function, which is used in `arc_read()`, `zio_free()`, etc. However, such corruption is not typically recoverable. To recover from it we would need to detect the memory error before the block pointer is written to disk. This PR verifies BP's that are contained in indirect blocks and dnodes before they are written to disk, in `dbuf_write_ready()`. This way, we'll get a panic before the on-disk data is corrupted. This will help us to diagnose what's causing the corruption, as well as being much easier to recover from. To minimize performance impact, only checks that can be done without holding the spa_config_lock are performed. Additionally, when corruption is detected, the raw words of the block pointer are logged. (Note that `dprintf_bp()` is a no-op by default, but if enabled it is not safe to use with invalid block pointers.) Reviewed-by: Rich Ercolani <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Zuchowski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #14817 commit 23132688b9d54ef11413925f88c02d83d607ec2b Author: Brian Behlendorf <[email protected]> Date: Mon May 8 11:17:41 2023 -0700 zdb: consistent xattr output When using zdb to output the value of an xattr only interpret it as printable characters if the entire byte array is printable. Additionally, if the --parseable option is set always output the buffer contents as octal for easy parsing. Reviewed-by: Olaf Faaland <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14830 commit 6deb342248e10af92e2d3fbb4e4b1221812188ff Author: Brian Behlendorf <[email protected]> Date: Mon May 8 10:09:30 2023 -0700 ZTS: add snapshot/snapshot_002_pos exception Add snapshot_002_pos to the known list of occasional failures for FreeBSD until it can be made entirely reliable. Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #14831 Closes #14832 commit a0a125bab291fe005d29be5375a5bb2a1c8261c7 Author: Alexander Motin <[email protected]> Date: Fri May 5 12:17:55 2023 -0400 Fix two abd_gang_add_gang() issues. - There is no reason to assert that added gang is not empty. It may be weird to add an empty gang, but it is legal. - When moving chain list from the added gang clear its size, or it will trigger assertion in abd_verify() when that gang is freed. Revie…

grahamperrin · 2023-07-09T18:35:05Z

zdb(8) example 2:

zfs/man/man8/zdb.8

Lines 530 to 534 in ca960ce

    
           .Ss Example 2 : No Display basic dataset information about Ar rpool 
        
           .Bd -literal 
        
           .No # Nm zdb Fl d Ar rpool 
        
           Dataset mos [META], ID 0, cr_txg 4, 26.9M, 1051 objects 
        
           Dataset rpool/swap [ZVOL], ID 59, cr_txg 356, 486M, 2 objects

Does that exemplify a ZFS volume for swap?

Blame: dd4769a#diff-c91195b2e58857a22090fb105807f3a46299eeba2a46d78d44dbff64d4e2ca7aR440-R442

chapmajs · 2024-08-11T16:41:27Z

Any update on this in 2024? Provisioning some servers and wanted to move to ZFS swap device and not having to partition up the spinning disks.

RubenKelevra · 2024-08-11T17:49:03Z

@chapmajs

I'm using one zvol as ZRAM writeback-device and another zvol as secondary swap, after ZRAM. No issues so far. Just sharing my observations, no guarantee, thought.

zvol for swap got the following settings:

NAME            PROPERTY               VALUE                  SOURCE
pool/SWAP  volsize                64G                    local
pool/SWAP  volblocksize           128K                   -
pool/SWAP  checksum               on                     default
pool/SWAP  compression            zstd-fast              local
pool/SWAP  primarycache           metadata               local
pool/SWAP  secondarycache         none                   local
pool/SWAP  logbias                throughput             local
pool/SWAP  dedup                  off                    default
pool/SWAP  sync                   disabled               local
pool/SWAP  volmode                default                default
pool/SWAP  redundant_metadata     none                   local
pool/SWAP  encryption             off                    default
pool/SWAP  prefetch               all                    default

On the ZRAM writeback-device same settings, except encryption is off - obviously.

ZRAM has 25% size of the memory, the ZRAM writeback-device and the swap size are large enough to never run into any trouble - but I made them thin provisioned. So keep in mind that there needs to be storage left to not run into trouble with the server - if you want to do this as well.

The only change I made for the settings of ZFS is to pin down the maximum and minimum ARC size, as ZFS currently seems to all of a sudden drop the whole ARC space down to a couple of hundred mb, running into real trouble keeping up with IO afterward.

So I set the ARC size with the boot parameters zfs.zfs_arc_min=x zfs.zfs_arc_max=y, while x is 20% memory and y is 60%.

Regarding swapping I run vm.swappiness = 150, vm.watermark_boost_factor = 0, vm.watermark_scale_factor = 10 and vm.page-cluster = 0.

Hope this helps.

adamdmoss · 2024-08-11T19:44:36Z

@RubenKelevra what ZFS version? That doesn't match my last attempt at swap-on-zvol (early 2024 git?) so I'm excited that maybe something improved.

IvanVolosyuk · 2024-08-11T21:49:39Z

I think swap after zram can lead to priority inversion. On harder memory pressure pages used more frequently can end up on a disk and less frequent pages will be compressed in ram. I think zram backing device is the recommended way to spill data to dusk. It is possible to configure it to be aggressive in spilling it out, so that there is enough zram free for new pages. I didn't try to use zvol for backing device though, but it might be an interesting opportunity.

RubenKelevra · 2024-08-11T22:03:32Z

@IvanVolosyuk well there are modes to play with on the backing device. But I selected to just push uncompressable pages to the backing device, as this way I can avoid trying to compress them on ZFS level.

Sure, the priority may invert, but does that really matter that much? Decompression of zstd is quite fast on modern CPUs, but depending on the cores used it could very well be slower than reading from a fast NVMe storage.

So it just depends on how random the chunks have to be read from the NVMe then, I guess, if zram decompression or the NVMe can deliver the data faster back to memory.

I certainly can feel swapping in and out sometimes, as its a desktop system, but it's just a tad sluggish for half a second or so.

I often see 6-8 GB swapped out on the ZFS. So it seems to work fine.

aircable · 2024-10-21T23:31:50Z

Great to see that it is still an open issue. It took me a long time to figure out why Ubuntu 24.04 was crashing. Not crashing, just a hang without anything. It always was happening when some graphics are used and got "freeing memory page" messages in dmesg in the graphics controller. This is i5 with Intel graphics.
Since then I converted the zfs swap partition to native partition for swap and it is now stable, other than the occasional "out of memory" kill of a process.
Since I'm just gotten into this issue, you guys seem to have some idea how to work around and configure the zfs swap correctly. Would you share a summary of the status here, please? Maybe some recommendations?

RubenKelevra · 2024-10-22T07:06:36Z

Not having any crash here with swap on ZFS, even when gaming, Witcher 3 on my Ultrabook.

I would call that one fixed. 🤷‍♂️

Can share my setup later when I get home.

Harvie · 2024-10-22T07:39:15Z

Does playing Witcher 3 on your ultrabook push memory consumption to its limits? I only had this happen on super over-crowded hosting servers during rush hour...

RubenKelevra · 2024-10-22T09:06:43Z

@Harvie definitely. It got only an Intel GPU build into the CPU which uses the system memory, which is only 16 GB.

discoltk · 2024-10-22T12:43:31Z

I think it unwise to assume someone NOT having a problem means the problem isn't there. IIRC I was following this thread because of it happening to me on FreeBSD, so I believe it was a wider problem with openzfs. I switched to using a physical partition as swap and I'm not going to try to break it to verify, but I do not recall the bug ever being closed out. Don't doubt yourself if the symptoms match the bug. Change to a dedicated partition and see if it fixes it.

…

On Tue, Oct 22, 2024 at 10:07 AM @RubenKelevra ***@***.***> wrote: @Harvie <https://github.com/Harvie> definitely. It got only an Intel GPU build into the CPU which uses the system memory, which is only 16 GB. — Reply to this email directly, view it on GitHub <#7734 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAOXUMV2GQT74ZACH5BP4DZ4YITVAVCNFSM4FLGXRVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TENBSHA3TAOJYHEZQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

davidklaftenegger · 2024-10-22T12:53:28Z

Maybe some recommendations?

I am not aware of any fix for this, so my recommendation for swap on zfs remains "don't".

Harvie · 2024-10-22T12:56:08Z

I mean... This seems obvious... As long as ZFS might ever need to get memory allocation from kernel in order to write data to ZVOL, it will never be safe to have swap on it, since allocating memory can lead to swapping, which in turn means you might need to swap in order to swap. Until this gets fixed, there is no way this might be reliable.

Might be probably fixed by providing option to (staticaly?) pre-allocate enough memory to cover writing to ZVOL in such cases.

DemiMarie · 2024-10-22T19:56:18Z

@Harvie I agree. I believe Linux forbids block-layer writeback from needing to allocate memory to make forward progress for this reason.

aircable · 2024-10-22T20:37:45Z

For me, it is totally clear. Since I removed swap on zfs the machine has not crashed. And I was experiencing a crash a day or more when I really work the machine. It was easy for me since I have a partition on the disk. I deleted the pool/swap and added it just as a raw partition for swap. This is 6.8.0-47-generic on a i5-11400. zfs-2.2.2-0ubuntu9, zfs-kmod-2.2.2-0ubuntu9.

I was hoping there is a special configuration you guys know, such as compression etc. for a zfs swap that would make it work.

And BTW, since I only have zfs for root I was not successful to add a swapfile. Do you have any recommendations how to achieve that?

DurvalMenezes · 2024-10-22T20:51:44Z

My $0.02: swap on ZFS has not worked since always, and I avoid it like the plague (always create a separate partition for it). I recommend the same to everyone.

…

On Tue, Oct 22, 2024, 17:38 Juergen Kienhoefer ***@***.***> wrote: For me, it is totally clear. Since I removed swap on zfs the machine has not crashed. And I was experiencing a crash a day or more when I really work the machine. It was easy for me since I have a partition on the disk. I deleted the pool/swap and added it just as a raw partition for swap. This is 6.8.0-47-generic on a i5-11400. zfs-2.2.2-0ubuntu9, zfs-kmod-2.2.2-0ubuntu9. I was hoping there is a special configuration you guys know, such as compression etc. for a zfs swap that would make it work. And BTW, since I only have zfs for root I was not successful to add a swapfile. Do you have any recommendations how to achieve that? — Reply to this email directly, view it on GitHub <#7734 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJHXRX2D3YLRPXE37WTRGTZ42ZT5AVCNFSM4FLGXRVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TENBTGAZDCNRTGMZQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

amotin · 2024-10-22T21:00:33Z

While I agree that swapping on ZFS is not a good idea, since it require many new memory allocations. I hope ZFS 2.3 with some improvements in it should better react on kernel's memory pressure requests to help it from another side, at least while ZFS still has some memory to free.

blessendor · 2024-10-28T11:37:51Z

Swap on ZFS would be great feature for virtual machines live migration between hypervisor hosts. For example with Proxmox VE we can run VM with ZFS storage which will be replecated every minute via zfs snapshots with second node in a cluster. But if we can't use SWAP on ZFS, we cant replecate a SWAP volume as well.

RubenKelevra · 2024-10-28T12:14:10Z

Well, there was never an issue running VMs with Swap on ZFS, because VM can't make the host run out of memory for it's ZFS implementation - at least not of you leave the Host some breathing room.

The reason Swap on ZFS is an issue is, that if the system needs memory it will start swapping, but this will prompt ZFS to use more memory, because of the additional IO. So instead of getting more memory free, this operation increases the memory pressure.

I think in my use case this isn't an issue, as the tuning for ZRAM and the use of ZRAM as first primary Swap leads to the system being able to swap out without less memory pressure and thus is not leading to lockups.

makhomed · 2024-10-28T12:26:56Z

Swap on ZFS would be great feature for virtual machines live migration between hypervisor hosts. For example with Proxmox VE we can run VM with ZFS storage which will be replecated every minute via zfs snapshots with second node in a cluster. But if we can't use SWAP on ZFS, we cant replecate a SWAP volume as well.

Just don't use ZFS inside virtual machines.

Create zvol on hypervisor (bare metal server) and provide zvol for virtual machines as virtual disks.

You can't use ZFS fro swap only on the bare metal server, where zfs module is loaded inside kernel.

Inside virtual machines you don`t need ZFS kernel module at all, bare metal server zvol is just a ordinary virtual disk for virtual machine.

I use this technology many years on many servers without any problems.

Bare metal server has 256 GiB separate swap partition on NVMe and has zfs kernel module loaded inside kernel and has KVM kernel module and this bare metal server has many zvols fro virtual machines.

Virtual machines does not have zfs kernel module loaded and know nothing about zfs, and all zvols from bare metal server for all virtual machines look like ordinary block devises, and all works fine, without any deadlocks.

You can't create swap on the zfs at the bare metal server only.

Virtual machines know nothing about zfs, and bare metal server zvol for virtual machines is just virtual block device, so virtual machine kernel can use this virtual block device for swap without any problems.

On bare metal server I create separate 256 GiB zvol for use as swap partition for each virtual machine.

bare metal server has 256 GiB swap partinion on NVMe - zfs zvol can't be used here.

each virtual machine has 256 GiB swap partition on separate virtual disk, which is zvol on bare metal server.

Virtual machine use virtual block device as swap partition, and in works very fine, without any problems.

Virtual machine has no zfs kernel module loaded and virtual machine know nothing about zfs.

Virtual machine dont use zfs zvol for swap and have no problems with swap on zvol, for virtual machine it is just virtual block device.

Harvie · 2024-10-29T08:53:45Z

I love using ZFS with Proxmox LXC containers, unfortunately this cannot be applied there...

gaia · 2024-10-29T08:57:51Z

I love using ZFS with Proxmox LXC containers, unfortunately this cannot be applied there...

Proxmox puts the container swap inside the host's dedicated swap space, not within the container's disks. Because of this, it is safe to use ZFS for Proxmox LXC containers.

Or did I misunderstand what you said?

Harvie · 2024-10-29T09:08:43Z

Proxmox puts the container swap inside the host's dedicated swap space

But sometimes it feels like it would make sense to put proxmox on ZFS, because that's how i manage most of my available disk space (to avoid LVM)...

behlendorf added the Component: Memory Management kernel memory management label Jul 25, 2018

runderwo mentioned this issue Sep 27, 2018

Three-way deadlock between z_iput, sync(), and fsync() #7964

Open

mcr-ksh mentioned this issue Jun 17, 2019

zfs kernel panic & pool lockup - VERIFY3(range_tree_space(rt) + size <= sm->sm_size) failed #8918

Closed

Kramerican mentioned this issue Sep 8, 2019

Clarification regarding swap: Possible to define swap volume? canonical/lxd#6168

Closed

devZer0 mentioned this issue Oct 22, 2019

swap on zvol causes high system latency when memory pressure occurs #9435

Closed

ThinkChaos mentioned this issue Dec 7, 2023

Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 to 6.10.14 #15140

Closed

Swap deadlock in 0.7.9 #7734

Swap deadlock in 0.7.9 #7734

Comments

runderwo commented Jul 21, 2018 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

shartse commented Aug 28, 2018 • edited Loading

System information

Describe the problem you're observing

Reproducing the problem

Configure a zvol as a swap device

Run a high memory operation

Examples

Example of the thread trying to swap out page:

Two places where zios were stuck

siv0 commented Sep 4, 2018 • edited Loading

System information

behlendorf commented Sep 5, 2018

siv0 commented Sep 5, 2018

dweeezil commented Sep 7, 2018

cwedgwood commented Sep 7, 2018

behlendorf commented Sep 7, 2018

drescherjm commented Sep 7, 2018 • edited Loading

inpos commented Sep 17, 2018

ryao commented Sep 17, 2018 • edited Loading

ryao commented Sep 17, 2018 • edited Loading

inpos commented Sep 17, 2018

ryao commented Sep 17, 2018

prakashsurya commented Sep 17, 2018

behlendorf commented Sep 17, 2018

prakashsurya commented Sep 17, 2018

davidklaftenegger commented Dec 7, 2018

MobyGamer commented Jan 18, 2019

mafredri commented Jan 18, 2019

gmelikov commented Jan 21, 2019

didrocks commented Oct 31, 2019

scineram commented Oct 31, 2019

prakashsurya commented Oct 31, 2019

RSully commented Nov 4, 2019

prakashsurya commented Nov 5, 2019

makhomed commented Apr 24, 2023

grahamperrin commented Jul 9, 2023

chapmajs commented Aug 11, 2024

RubenKelevra commented Aug 11, 2024 • edited Loading

adamdmoss commented Aug 11, 2024

IvanVolosyuk commented Aug 11, 2024

RubenKelevra commented Aug 11, 2024 • edited Loading

aircable commented Oct 21, 2024

RubenKelevra commented Oct 22, 2024 • edited Loading

Harvie commented Oct 22, 2024

RubenKelevra commented Oct 22, 2024

discoltk commented Oct 22, 2024 via email

davidklaftenegger commented Oct 22, 2024

Harvie commented Oct 22, 2024

DemiMarie commented Oct 22, 2024

aircable commented Oct 22, 2024

DurvalMenezes commented Oct 22, 2024 via email

amotin commented Oct 22, 2024

blessendor commented Oct 28, 2024

RubenKelevra commented Oct 28, 2024

makhomed commented Oct 28, 2024

Harvie commented Oct 29, 2024

gaia commented Oct 29, 2024

Harvie commented Oct 29, 2024 • edited Loading

runderwo commented Jul 21, 2018 •

edited

Loading

shartse commented Aug 28, 2018 •

edited

Loading

Configure a `zvol` as a swap device

siv0 commented Sep 4, 2018 •

edited

Loading

drescherjm commented Sep 7, 2018 •

edited

Loading

ryao commented Sep 17, 2018 •

edited

Loading

ryao commented Sep 17, 2018 •

edited

Loading

RubenKelevra commented Aug 11, 2024 •

edited

Loading

RubenKelevra commented Aug 11, 2024 •

edited

Loading

RubenKelevra commented Oct 22, 2024 •

edited

Loading

Harvie commented Oct 29, 2024 •

edited

Loading