Why BTRFS doesn't offer live deduplication as ZFS? #304

tlaurion · 2025-03-14T17:20:04Z

Really high level question, I know. Really appreciate this project, but can't stop thinking that if the hash table was managed by the kernel, async (AIO) was used in kernel to spool the writes and check extends prior of write, the initial write could be deduped instead of being rewritten?

I guess this would add way to much delays to be implemented inside of the kernel as ZFS does? (Not tested but plan doing so on top of QubesOS since they now ship dom0 requirements for dom0, while installer doesn't permit ZFS partitioning and pool creation so templates and qubes are created in stage2 install as for BTRFS).

Was interested in reading your thoughts on why BTRFS doesn't offer live deduplication and only offline deduplication, which bees permits.
Thanks!

Zygo · 2025-03-15T23:54:52Z

A brief history of Linux dedupe implementations

Deduper	Filesystem	Year	Type	Search method	Block size	Memory size	Overflow to disk
built-in	ZFS	2008	in-band	page write	large	large	yes (optional since 2024)
Permabit (VDO)	blockdev	2009 (VDO in 2017)	in-band	block write	small	fixed	no
duperemove	btrfs / xfs / bcachefs	2013 (bcachefs in 2017)	out-of-band	file tree walk + FIEMAP filter + file read	large	large	yes
dedupe out-of-tree kernel patch	btrfs	2013-2019, never completed	in-band	page write	large	small	no
bees	btrfs	2014	out-of-band	fs-specific tree search + block read	small	fixed (sampling)	no
`GETFSMAP` experiments ¹	xfs	2017-2021	out-of-band	sequential device read	TBD	fixed (partitioning)	no
dduper²	btrfs	2020	out-of-band	file tree walk + csum tree read	large	large	yes

Dave Chinner proposed a deduper based on GETFSMAP and raw block device reads, which would perform out-of-band dedupe during a scrub operation. Last time I checked, this effort was sidelined because the XFS reflink implementation ends up being slower than every viable alternative except btrfs.
A prototype was developed, but never finished. There have been a few projects like this, because people keep getting the same idea.

The ZFS approach does avoid duplicate data writes, and it's almost unique among dedupe strategies for that. It also has far higher memory and metadata IO requirements than any other dedupe strategy, all else being equal. The design is a naive content-addressable data store, despite enormous resources available to Sun's developers at the time, and active contemporary research in alternative dedupe strategies like VDO and reflink. The conventional guidance for ZFS users is to never use dedupe, especially after reflink support became available in ZFS, despite recent ZFS dedupe improvements.

your thoughts on why BTRFS doesn't offer live deduplication and only offline deduplication, which bees permits.

My guess is that in-band dedupe in btrfs simply isn't worth the effort:

ZFS-style dedupe has high memory and metadata IO costs that could be better spent on page cache, application data, or even extending SSD write lifespan (i.e. by not performing dedupe metadata operations on hybrid SSD/HDD setups).
If a user has unlimited budget for RAM, they can run ZFS today. There is no need to invest in development of another filesystem if the existing ZFS dedupe is acceptable. If the existing ZFS dedupe isn't acceptable, ZFS may not be the best starting point for dedupe alternatives.
In-band dedupe isn't necessarily a filesystem-specific feature--it could be implemented in the Linux VFS layer, using reflink and dedupe operations already supported by multiple filesystems. btrfs metadata performance isn't great compared to other filesystems with viable dedupe, like xfs or bcachefs. Those filesystems might be a better starting point than btrfs for anyone who wants to implement in-band Linux dedupe.
(Another edit) We are close to the performance ceiling of in-band dedupers, but we are nowhere near the performance ceiling on out-of-band dedupers. Dave Chinner's idea combines a full-filesystem dedupe with a scrub operation that users are likely running already for data integrity purposes, making the marginal cost of dedupe negligible. csum-based dedupers can scan hundreds of GiB per second of logical data on a middle-tier desktop. I estimate we can make out-of-band dedupers between 100x and 2000x faster than bees is today.

(edit) Oh, I almost forgot about VDO, the Other Inline Deduper.

VDO's design, grossly oversimplified, holds a hash table in memory which is used to locate and remap duplicate 4K blocks. VDO's main problem is that blindly deduping at 4K block size leads to massive fragmentation due to nuisance dedupes, which ultimately kills performance. Naive deduping a block device at greater than 4K block size leads to almost zero hit rates, because dedupe can only occur when a filesystem extent aligns with dedupe blocks at fixed locations.

The main complaint I used to get about bees from VDO users is that bees (up to v0.10) and VDO were too similar--because they both could dedupe on 4K boundaries, they both fragmented the filesystem to exhaustion. bees v0.11 fixes that by adding heuristics to determine if bees should dedupe smaller extents. It looks like this could be fixed in VDO the same way, but for whatever reason, the maintainer hasn't chosen to do that yet (there are github issues on kvdo about this, closed 6 years ago).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why BTRFS doesn't offer live deduplication as ZFS? #304

Why BTRFS doesn't offer live deduplication as ZFS? #304

tlaurion commented Mar 14, 2025

Zygo commented Mar 15, 2025 •

edited

Loading

Why BTRFS doesn't offer live deduplication as ZFS? #304

Why BTRFS doesn't offer live deduplication as ZFS? #304

Comments

tlaurion commented Mar 14, 2025

Zygo commented Mar 15, 2025 • edited Loading

A brief history of Linux dedupe implementations

Zygo commented Mar 15, 2025 •

edited

Loading