Skip to content

Partial write happens after VM crash with recordsize=16k #17879

@healwon

Description

@healwon

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 25.04
Kernel Version 6.14.0-32-generic
Architecture x86_64
OpenZFS Version zfs-2.3.1-1ubuntu2 / zfs-kmod-2.3.1-1ubuntu2

Describe the problem you're observing

During crash-consistency testing, I observed a torn write issue that appears to occur only when recordsize=16k (and all other parameters are defaults).

After a VM crash during a multi-threaded write workload with 16 KB granularity, the recovered file contains a misaligned data segment — indicating partial (torn) writes at specific offsets.

Describe how to reproduce the problem

I tested crash scenario with QEMU.

1. Create zfs pool

zpool create zfspool /dev/sdc -f
zfs set recordsize=16K zfspool
zfs set mountpoint=/mnt/test

2. Run workload

For testing, I wrote a simple microbenchmark that spawns between multiple (16 ~ 256) threads. Each thread repeatedly overwrites its own zero-filled file using 16kb data chunks.

3. Terminate VM

4. Reboot VM & check the remounted filesystem

After reboot, remount the ZFS pool and inspect written data:

root@ubuntu:/mnt/test# hexdump mtfsync.out.14

...

00069f0 2d75 98e0 0850 73bb de2b 4e96 b906 2971
0006a00 94dc 044c 6fb7 da27 4a92 b502 256d 90d8
0006a10 fb48 6bb3 d623 468e b1f9 2169 8cd4 f744
0006a20 67af d21f 428a adf5 1d65 88d0 f340 63ab
0006a30 ce1b 3e86 a9f1 1961 84cc ef3c 5fa7 ca17
0006a40 3a82 a5ed 155d 80c8 eb38 5ba3 c613 367e
0006a50 a1e9 1159 7cc4 e734 0000 0000 0000 0000
0006a60 0000 0000 0000 0000 0000 0000 0000 0000
*
0800000

The data should be aligned to 16kb boundaries, but corruption appears at offset 0x6a60, which means torn write occured and not recovered.

Include any warning/errors/backtraces from the system logs

After I examined the code, I found this comment:

	/*
	 * There must be enough space in the log block to hold reclen.
	 * For WR_COPIED, we need to fit the whole record in one block,
	 * and reclen is the write record header size + the data size.
	 * For WR_NEED_COPY, we can create multiple records, splitting
	 * the data into multiple blocks, so we only need to fit one
	 * word of data per block; in this case reclen is just the header
	 * size (no data).
	 */

While a record which is smaller then 8k or larger than 32k is not splited in zil_lwb_assign(), 16k-sized record may be split at the boundary of the log blocks. If the precedent log block is commited before crash while the following log block isn't, system crash may cause this torn write.

Let me know if I'm missing any details or more detailed information is required. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions