Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why BTRFS doesn't offer live deduplication as ZFS? #304

Open
tlaurion opened this issue Mar 14, 2025 · 1 comment
Open

Why BTRFS doesn't offer live deduplication as ZFS? #304

tlaurion opened this issue Mar 14, 2025 · 1 comment

Comments

@tlaurion
Copy link

Really high level question, I know. Really appreciate this project, but can't stop thinking that if the hash table was managed by the kernel, async (AIO) was used in kernel to spool the writes and check extends prior of write, the initial write could be deduped instead of being rewritten?

I guess this would add way to much delays to be implemented inside of the kernel as ZFS does? (Not tested but plan doing so on top of QubesOS since they now ship dom0 requirements for dom0, while installer doesn't permit ZFS partitioning and pool creation so templates and qubes are created in stage2 install as for BTRFS).

Was interested in reading your thoughts on why BTRFS doesn't offer live deduplication and only offline deduplication, which bees permits.
Thanks!

@Zygo
Copy link
Owner

Zygo commented Mar 15, 2025

A brief history of Linux dedupe implementations

Deduper Filesystem Year Type Search method Block size Memory size Overflow to disk
built-in ZFS 2008 in-band page write large large yes (optional since 2024)
Permabit (VDO) blockdev 2009 (VDO in 2017) in-band block write small fixed no
duperemove btrfs / xfs / bcachefs 2013 (bcachefs in 2017) out-of-band file tree walk + FIEMAP filter + file read large large yes
dedupe out-of-tree kernel patch btrfs 2013-2019, never completed in-band page write large small no
bees btrfs 2014 out-of-band fs-specific tree search + block read small fixed (sampling) no
GETFSMAP experiments 1 xfs 2017-2021 out-of-band sequential device read TBD fixed (partitioning) no
dduper2 btrfs 2020 out-of-band file tree walk + csum tree read large large yes
  1. Dave Chinner proposed a deduper based on GETFSMAP and raw block device reads, which would perform out-of-band dedupe during a scrub operation. Last time I checked, this effort was sidelined because the XFS reflink implementation ends up being slower than every viable alternative except btrfs.
  2. A prototype was developed, but never finished. There have been a few projects like this, because people keep getting the same idea.

The ZFS approach does avoid duplicate data writes, and it's almost unique among dedupe strategies for that. It also has far higher memory and metadata IO requirements than any other dedupe strategy, all else being equal. The design is a naive content-addressable data store, despite enormous resources available to Sun's developers at the time, and active contemporary research in alternative dedupe strategies like VDO and reflink. The conventional guidance for ZFS users is to never use dedupe, especially after reflink support became available in ZFS, despite recent ZFS dedupe improvements.

your thoughts on why BTRFS doesn't offer live deduplication and only offline deduplication, which bees permits.

My guess is that in-band dedupe in btrfs simply isn't worth the effort:

  • ZFS-style dedupe has high memory and metadata IO costs that could be better spent on page cache, application data, or even extending SSD write lifespan (i.e. by not performing dedupe metadata operations on hybrid SSD/HDD setups).
  • If a user has unlimited budget for RAM, they can run ZFS today. There is no need to invest in development of another filesystem if the existing ZFS dedupe is acceptable. If the existing ZFS dedupe isn't acceptable, ZFS may not be the best starting point for dedupe alternatives.
  • In-band dedupe isn't necessarily a filesystem-specific feature--it could be implemented in the Linux VFS layer, using reflink and dedupe operations already supported by multiple filesystems. btrfs metadata performance isn't great compared to other filesystems with viable dedupe, like xfs or bcachefs. Those filesystems might be a better starting point than btrfs for anyone who wants to implement in-band Linux dedupe.
  • (Another edit) We are close to the performance ceiling of in-band dedupers, but we are nowhere near the performance ceiling on out-of-band dedupers. Dave Chinner's idea combines a full-filesystem dedupe with a scrub operation that users are likely running already for data integrity purposes, making the marginal cost of dedupe negligible. csum-based dedupers can scan hundreds of GiB per second of logical data on a middle-tier desktop. I estimate we can make out-of-band dedupers between 100x and 2000x faster than bees is today.

(edit) Oh, I almost forgot about VDO, the Other Inline Deduper.

VDO's design, grossly oversimplified, holds a hash table in memory which is used to locate and remap duplicate 4K blocks. VDO's main problem is that blindly deduping at 4K block size leads to massive fragmentation due to nuisance dedupes, which ultimately kills performance. Naive deduping a block device at greater than 4K block size leads to almost zero hit rates, because dedupe can only occur when a filesystem extent aligns with dedupe blocks at fixed locations.

The main complaint I used to get about bees from VDO users is that bees (up to v0.10) and VDO were too similar--because they both could dedupe on 4K boundaries, they both fragmented the filesystem to exhaustion. bees v0.11 fixes that by adding heuristics to determine if bees should dedupe smaller extents. It looks like this could be fixed in VDO the same way, but for whatever reason, the maintainer hasn't chosen to do that yet (there are github issues on kvdo about this, closed 6 years ago).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants