Partial write happens after VM crash with `recordsize=16k`

### System information

Type | Version/Name
 --- | ---
Distribution Name	| Ubuntu 
Distribution Version	|25.04
Kernel Version	| 6.14.0-32-generic
Architecture	| x86_64
OpenZFS Version	| zfs-2.3.1-1ubuntu2 / zfs-kmod-2.3.1-1ubuntu2



### Describe the problem you're observing

During crash-consistency testing, I observed a torn write issue that appears to occur only when recordsize=16k (and all other parameters are defaults).

After a VM crash during a multi-threaded write workload with 16 KB granularity, the recovered file contains a misaligned data segment — indicating partial (torn) writes at specific offsets.


### Describe how to reproduce the problem

I tested crash scenario with QEMU.

**1. Create zfs pool**

```
zpool create zfspool /dev/sdc -f
zfs set recordsize=16K zfspool
zfs set mountpoint=/mnt/test
```

**2. Run workload**

For testing, I wrote a simple microbenchmark that spawns between multiple (16 ~ 256) threads. Each thread repeatedly overwrites its own zero-filled file using 16kb data chunks.

**3. Terminate VM**

**4. Reboot VM & check the remounted filesystem**

After reboot, remount the ZFS pool and inspect written data:

```bash
root@ubuntu:/mnt/test# hexdump mtfsync.out.14

...

00069f0 2d75 98e0 0850 73bb de2b 4e96 b906 2971
0006a00 94dc 044c 6fb7 da27 4a92 b502 256d 90d8
0006a10 fb48 6bb3 d623 468e b1f9 2169 8cd4 f744
0006a20 67af d21f 428a adf5 1d65 88d0 f340 63ab
0006a30 ce1b 3e86 a9f1 1961 84cc ef3c 5fa7 ca17
0006a40 3a82 a5ed 155d 80c8 eb38 5ba3 c613 367e
0006a50 a1e9 1159 7cc4 e734 0000 0000 0000 0000
0006a60 0000 0000 0000 0000 0000 0000 0000 0000
*
0800000
```

The data should be aligned to 16kb boundaries, but corruption appears at offset 0x6a60, which means torn write occured and not recovered.


### Include any warning/errors/backtraces from the system logs


After I examined the code, I found [this comment](https://github.com/openzfs/zfs/blob/b2196fbedf5dbfb8593288f5f9ba712e31429a84/module/zfs/zil.c#L2336C1-L2344C5):

```
	/*
	 * There must be enough space in the log block to hold reclen.
	 * For WR_COPIED, we need to fit the whole record in one block,
	 * and reclen is the write record header size + the data size.
	 * For WR_NEED_COPY, we can create multiple records, splitting
	 * the data into multiple blocks, so we only need to fit one
	 * word of data per block; in this case reclen is just the header
	 * size (no data).
	 */
```

While a record which is smaller then 8k or larger than 32k is not splited in `zil_lwb_assign()`, 16k-sized record may be split at the boundary of the log blocks. If the precedent log block is commited before crash while the following log block isn't, system crash may cause this torn write.

Let me know if I'm missing any details or more detailed information is required. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partial write happens after VM crash with `recordsize=16k` #17879

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	25.04
Kernel Version	6.14.0-32-generic
Architecture	x86_64
OpenZFS Version	zfs-2.3.1-1ubuntu2 / zfs-kmod-2.3.1-1ubuntu2

Partial write happens after VM crash with recordsize=16k #17879

Description

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Partial write happens after VM crash with `recordsize=16k` #17879