Skip to content

Kernel panic on ZFS module unload with NULL pointer dereference in spl_kmem_cache_free #18174

@arturpzol

Description

@arturpzol

System information

Type Version/Name
Distribution Name Debian
Distribution Version N/A
Kernel Version 5.15.189
Architecture x86_64
OpenZFS Version 2.3.5

Describe the problem you're observing

Kernel panic occurs during ZFS module unload (rmmod zfs) with NULL pointer dereference in spl_kmem_cache_free(). The crash happens when destroying zfs_znode_cache.

Describe how to reproduce the problem

No 100% reliable reproduction scenario identified. The issue is timing-dependent and occurs most frequently under the following conditions:

  • Large pool with many files/inodes
  • Significant filesystem activity before unmount
  • Executing rmmod zfs shortly after zfs umount
  • System shutdown/reboot scenarios

Userspace workarounds (sync, drop_caches, sleep) do not reliably prevent the crash.

Include any warning/errors/backtraces from the system logs

[ 8070.222528] =============================================================================
[ 8070.222538] BUG zfs_znode_cache (Tainted: P           O     ): Objects remaining in zfs_znode_cache on __kmem_cache_shutdown()
[ 8070.222553] -----------------------------------------------------------------------------
               
[ 8070.222554] Slab 0x00000000d74cad89 objects=29 used=1 fp=0x000000008340d8f5 flags=0x17fff0000010200(slab|head|node=0|zone=2|lastcpupid=0x7fff)
[ 8070.222561] CPU: 15 PID: 46268 Comm: rmmod Kdump: loaded Tainted: P           O      5.15.189 #4 99eff10b7a15abd00a5f147b42deac89a210e1eb
[ 8070.222564] Hardware name: Intel Corporation S2600IP/S2600IP, BIOS SE5C600.86B.02.06.0007.082420181029 08/24/2018
[ 8070.222566] Call Trace:
[ 8070.222570]  <TASK>
[ 8070.222572]  dump_stack_lvl+0x45/0x5b
[ 8070.222578]  slab_err+0x94/0xcb
[ 8070.222582]  ? cpumask_next+0x1e/0x30
[ 8070.222586]  __kmem_cache_shutdown.cold+0x50/0x1bf
[ 8070.222590]  kmem_cache_destroy+0x45/0xe0
[ 8070.222596]  spl_kmem_cache_destroy+0x15a/0x1d0 [spl 13f8907e7fb151d3ffcbbcbd8bb82336a42a609c]
[ 8070.222605]  ? synchronize_rcu+0x62/0x80
[ 8070.222611]  ? invoke_rcu_core+0xa0/0xa0
[ 8070.222613]  zfs_znode_fini+0x16/0x50 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.222744]  zfs_fini+0x2e/0x40 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.222847]  zfs_kmod_fini+0x67/0xc0 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.222952]  openzfs_fini+0xa/0x5ee [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223058]  __do_sys_delete_module+0x1a5/0x250
[ 8070.223062]  ? exit_to_user_mode_prepare+0x30/0x180
[ 8070.223064]  do_syscall_64+0x3c/0x90
[ 8070.223069]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[ 8070.223074] RIP: 0033:0x7f7503091b17
[ 8070.223076] Code: 73 01 c3 48 8b 0d 71 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 c3 2b 00 f7 d8 64 89 01 48
[ 8070.223078] RSP: 002b:00007fff2ebf0978 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 8070.223080] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7503091b17
[ 8070.223082] RDX: 00007f75030f7c60 RSI: 0000000000000800 RDI: 00005647f46d33e0
[ 8070.223083] RBP: 00005647f46d3380 R08: 00007f750334ef20 R09: 00007fff2ebef8f1
[ 8070.223084] R10: 00007fff2ebf0740 R11: 0000000000000206 R12: 00007fff2ebf0ba0
[ 8070.223085] R13: 00007fff2ebf0e5d R14: 0000000000000000 R15: 00005647f46d3380
[ 8070.223087]  </TASK>
[ 8070.223090] Object 0x0000000019505859 @offset=0
[ 8070.223092] =============================================================================
[ 8070.223092] BUG zfs_znode_cache (Tainted: P    B      O     ): Objects remaining in zfs_znode_cache on __kmem_cache_shutdown()
[ 8070.223093] -----------------------------------------------------------------------------
               
[ 8070.223094] Slab 0x00000000b186fa5b objects=29 used=1 fp=0x000000004e12d82f flags=0x17fff0000010200(slab|head|node=0|zone=2|lastcpupid=0x7fff)
[ 8070.223096] CPU: 15 PID: 46268 Comm: rmmod Kdump: loaded Tainted: P    B      O      5.15.189 #4 99eff10b7a15abd00a5f147b42deac89a210e1eb
[ 8070.223098] Hardware name: Intel Corporation S2600IP/S2600IP, BIOS SE5C600.86B.02.06.0007.082420181029 08/24/2018
[ 8070.223099] Call Trace:
[ 8070.223100]  <TASK>
[ 8070.223100]  dump_stack_lvl+0x45/0x5b
[ 8070.223102]  slab_err+0x94/0xcb
[ 8070.223104]  ? _printk+0x58/0x73
[ 8070.223108]  ? cpumask_next+0x1e/0x30
[ 8070.223110]  __kmem_cache_shutdown.cold+0x50/0x1bf
[ 8070.223112]  kmem_cache_destroy+0x45/0xe0
[ 8070.223114]  spl_kmem_cache_destroy+0x15a/0x1d0 [spl 13f8907e7fb151d3ffcbbcbd8bb82336a42a609c]
[ 8070.223121]  ? synchronize_rcu+0x62/0x80
[ 8070.223123]  ? invoke_rcu_core+0xa0/0xa0
[ 8070.223125]  zfs_znode_fini+0x16/0x50 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223227]  zfs_fini+0x2e/0x40 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223328]  zfs_kmod_fini+0x67/0xc0 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223433]  openzfs_fini+0xa/0x5ee [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223538]  __do_sys_delete_module+0x1a5/0x250
[ 8070.223539]  ? exit_to_user_mode_prepare+0x30/0x180
[ 8070.223541]  do_syscall_64+0x3c/0x90
[ 8070.223543]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[ 8070.223545] RIP: 0033:0x7f7503091b17
[ 8070.223547] Code: 73 01 c3 48 8b 0d 71 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 c3 2b 00 f7 d8 64 89 01 48
[ 8070.223548] RSP: 002b:00007fff2ebf0978 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 8070.223550] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7503091b17
[ 8070.223551] RDX: 00007f75030f7c60 RSI: 0000000000000800 RDI: 00005647f46d33e0
[ 8070.223552] RBP: 00005647f46d3380 R08: 00007f750334ef20 R09: 00007fff2ebef8f1
[ 8070.223553] R10: 00007fff2ebf0740 R11: 0000000000000206 R12: 00007fff2ebf0ba0
[ 8070.223555] R13: 00007fff2ebf0e5d R14: 0000000000000000 R15: 00005647f46d3380
[ 8070.223556]  </TASK>
[ 8070.223559] Object 0x000000001f054720 @offset=0
[ 8070.223560] kmem_cache_destroy zfs_znode_cache: Slab cache still has objects
[ 8070.223561] CPU: 15 PID: 46268 Comm: rmmod Kdump: loaded Tainted: P    B      O      5.15.189 #4 99eff10b7a15abd00a5f147b42deac89a210e1eb
[ 8070.223564] Hardware name: Intel Corporation S2600IP/S2600IP, BIOS SE5C600.86B.02.06.0007.082420181029 08/24/2018
[ 8070.223565] Call Trace:
[ 8070.223565]  <TASK>
[ 8070.223566]  dump_stack_lvl+0x45/0x5b
[ 8070.223568]  kmem_cache_destroy.cold+0x1c/0x21
[ 8070.223570]  spl_kmem_cache_destroy+0x15a/0x1d0 [spl 13f8907e7fb151d3ffcbbcbd8bb82336a42a609c]
[ 8070.223576]  ? synchronize_rcu+0x62/0x80
[ 8070.223578]  ? invoke_rcu_core+0xa0/0xa0
[ 8070.223580]  zfs_znode_fini+0x16/0x50 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223682]  zfs_fini+0x2e/0x40 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223784]  zfs_kmod_fini+0x67/0xc0 [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.223889]  openzfs_fini+0xa/0x5ee [zfs d67f7c9af4a662e93db1055a0ba77b4262911192]
[ 8070.224014]  __do_sys_delete_module+0x1a5/0x250
[ 8070.224016]  ? exit_to_user_mode_prepare+0x30/0x180
[ 8070.224017]  do_syscall_64+0x3c/0x90
[ 8070.224019]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[ 8070.224022] RIP: 0033:0x7f7503091b17
[ 8070.224024] Code: 73 01 c3 48 8b 0d 71 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 c3 2b 00 f7 d8 64 89 01 48
[ 8070.224025] RSP: 002b:00007fff2ebf0978 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 8070.224027] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7503091b17
[ 8070.224028] RDX: 00007f75030f7c60 RSI: 0000000000000800 RDI: 00005647f46d33e0
[ 8070.224029] RBP: 00005647f46d3380 R08: 00007f750334ef20 R09: 00007fff2ebef8f1
[ 8070.224030] R10: 00007fff2ebf0740 R11: 0000000000000206 R12: 00007fff2ebf0ba0
[ 8070.224031] R13: 00007fff2ebf0e5d R14: 0000000000000000 R15: 00005647f46d3380
[ 8070.224032]  </TASK>
[ 8070.232300] BUG: kernel NULL pointer dereference, address: 0000000000000028
[ 8070.232303] #PF: supervisor read access in kernel mode
[ 8070.232305] #PF: error_code(0x0000) - not-present page
[ 8070.232307] PGD 0 P4D 0 
[ 8070.232310] Oops: 0000 [#1] SMP PTI
[ 8070.232312] CPU: 14 PID: 85 Comm: ksoftirqd/14 Kdump: loaded Tainted: P    B      O      5.15.189 #4 99eff10b7a15abd00a5f147b42deac89a210e1eb
[ 8070.232315] Hardware name: Intel Corporation S2600IP/S2600IP, BIOS SE5C600.86B.02.06.0007.082420181029 08/24/2018
[ 8070.232316] RIP: 0010:spl_kmem_cache_free+0xe/0x1e0 [spl]
[ 8070.232325] Code: 8b 43 70 48 8d 58 90 48 3d 90 3c 98 a0 75 e2 48 c7 c7 60 3c 98 a0 5b e9 10 f7 77 e0 0f 1f 44 00 00 41 55 41 54 55 48 89 f5 53 <48> 8b 47 28 48 89 fb 48 85 c0 74 0c 48 8b 77 30 48 89 ef e8 ea 3e
[ 8070.232326] RSP: 0018:ffff888c4389fe08 EFLAGS: 00010286
[ 8070.232329] RAX: ffffffffa0d062f0 RBX: 0000000000000003 RCX: ffffffff82942740
[ 8070.232331] RDX: ffff888cdfa71488 RSI: ffff888cdfa70d08 RDI: 0000000000000000
[ 8070.232332] RBP: ffff888cdfa70d08 R08: 0000000000000282 R09: 0000000000000000
[ 8070.232333] R10: 0000000000000015 R11: 0000000000000029 R12: 0000000000000002
[ 8070.232334] R13: 000000000000000a R14: ffff88941efabbb0 R15: 0000000000000000
[ 8070.232336] FS:  0000000000000000(0000) GS:ffff88941ef80000(0000) knlGS:0000000000000000
[ 8070.232337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8070.232339] CR2: 0000000000000028 CR3: 0000000002c0a004 CR4: 00000000000606e0
[ 8070.232340] Call Trace:
[ 8070.232361]  <TASK>
[ 8070.232362]  rcu_core+0x207/0x650
[ 8070.232365]  handle_softirqs+0xe7/0x270
[ 8070.232371]  ? smpboot_register_percpu_thread+0xd0/0xd0
[ 8070.232376]  run_ksoftirqd+0x2f/0x40
[ 8070.232379]  smpboot_thread_fn+0xaf/0x140
[ 8070.232381]  kthread+0x118/0x140
[ 8070.232386]  ? set_kthread_struct+0x50/0x50
[ 8070.232388]  ret_from_fork+0x22/0x30
[ 8070.232394]  </TASK>
[ 8070.232396] Modules linked in: 8021q zfs(PO-) qat_api(O) spl(O) intel_qat(O) uio iptable_filter target_core_iblock target_core_pscsi iscsi_target_mod target_core_mod nvmet_rdma nvmet_tcp nvmet nvme_rdma nvme_tcp nvme_fabrics bonding ib_iser rdma_cm iw_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse mlx5_ib ib_umad ib_ipoib ib_cm mlx4_ib ib_uverbs ib_core mlx4_en mlx4_core x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel aesni_intel crypto_simd cryptd rapl intel_cstate mlx5_core mlxfw pci_hyperv_intf igb(O) ptp pps_core be2net button nls_iso8859_1 nls_cp437 ses sg ipmi_si ipmi_devintf ipmi_msghandler mpt3sas(O) nvme raid_class scsi_transport_sas nvme_core megaraid_sas(O) vfat fat aufs scsi_transport_fc [last unloaded: scst]
[ 8070.232456] CR2: 0000000000000028

Since SPL is GPL-licensed, it could technically call rcu_barrier() before destroying the Linux slab cache in spl_kmem_cache_destroy(). Would this be considered an acceptable approach, or would it be preferable to skip cache destruction when objects remain?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions