Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the affect on system performance? #305

Open
MarkusTieger opened this issue Mar 15, 2025 · 7 comments
Open

How is the affect on system performance? #305

MarkusTieger opened this issue Mar 15, 2025 · 7 comments

Comments

@MarkusTieger
Copy link

I haven't seen a single benchmark yet.

@kakra
Copy link
Contributor

kakra commented Mar 16, 2025

While bees itself consumes some resources in the background (and it's usually not noticeable, except for the first full pass), I can at least confirm that since v0.11, bees does a very good job of reducing latency of IO requests by coalescing extents which previously fragmented in a lot of small extents. This is because bees now prefers the larger extents if multiple match.

So if you combine that with selective defragmentation of some files, it is able to keep the filesystem in a good shape.

But IO latency on the desktop is very subjective and hard to measure, so I cannot provide any benchmarks either.

I pretty sure Zygo posted some diagrams in the past showing how effective bees is - which could probably count as a benchmark in a broader sense.

I wonder if there are diagrams about extent reduction or similar parameters.

Bees writes a lot of live statistics to a file each second. You could probably extract values (from this file and other system metrics) and create your own diagrams.

@Zygo
Copy link
Owner

Zygo commented Mar 16, 2025

There seem to be 3 kinds of benchmarks:

  1. General IO and system latency benchmarks (bonnie, fio, IOZone, etc)
  2. In-band-dedupe-specific benchmarks like DEDISbench
  3. Benchmarks that measure space reduction (or consumption) over time for a standardized workload

Category 1 is not going to be very informative: the benchmark doesn't tell you about bees performance. Instead, the benchmark gives information about the performance of the rest of the system while bees is running. All things being equal, a system running bees will be slower than a system not running bees, and you can know that information without running any benchmark at all. The difference in performance depends on how aggressive bees is configured to be.

Category 2 doesn't work for an out-of-band deduper, even when run concurrently with the write load. DEDISbench reports that the system has similar performance to a system that is not using in-band deduplication (because bees doesn't do in-band deduplication), and DEDISbench isn't running at all when bees does out-of-band deduplication.

Category 3 is highly dependent on the data set, with the topology of the data set (size and locations of the duplicate blocks) and, especially on btrfs, the filesystem-specific capabilities of the deduper. Category 3 results are not numbers, but curves. Out-of-band dedupers use different strategies to search for duplicate extents, and this can result in time intervals where the deduper is running, but not releasing any space. A curve provides details such as:

  • After starting dedupe, when is the first free space expected?
  • What is the average rate of duplicate removal?
  • If the process is interrupted, what is the recovery time before space is freed again?

Here's a graph showing rsync and bees running at the same time (red line) vs rsync without bees (green line):

Image

(note: since rsync is taking up space, the "space freed" is negative on this graph.)

Not much impact until there is some duplicate data, then the rsync slows down a little while bees runs behind it to clean up duplicates.


Here's a graph comparing DEDISbench's "ubuntu images" data set (red line) with some actual ubuntu images I downloaded (green line), both with bees running at the same time:

Image

DEDISbench's synthetic workload is unrealistic because its "duplicate" blocks are all the same, and unrealistically compressible. Compressed blocks can be deduped, but we don't save much space for each one. Real data has longer extents so the savings are more visible.

This graph illustrates why the content of the test corpus matters when benchmarking a deduper. Statistically, according to some narrowly-defined metrics, the above two data sets are identical, but their real-world dedupe behavior is very different.

The flat area on the right side of the green line corresponds to some nearly identical image files, so although the download rate was constant, the free space did not substantially decrease during the download because of concurrent deduplication.


Here's a graph comparing 21 versions of the Ubuntu24 cloud image from the last year, with bees running at the same time as the download (red line) or after the download (green line):

Image

(For consistency, the images were "downloaded" from a local SSD.)

There's not much space saved in this small, non-4K-aligned data set full of compressed files, but it does show that incremental dedupe-as-you-go does result in smaller peak data size and smaller final size--and that DEDISbench is unrealistic.


Here's a graph showing a 154 GiB data set (Windows raw VM image files) that I've been using for some time to evaluate bees development changes and compare with other out-of-band block-based dedupers on btrfs. At the moment there is only one of those that is still maintained (duperemove).

Image

(Note: for consistency, all dedupers in this test are run starting with a pre-filled filesystem, so the "space freed" is positive in this graph.)

This graph shows that there is no amount of time T such that duperemove releases more space than bees after running T seconds. The graph also highlight's bees's extent size sorting, which places the slow, unproductive tiny dedupes at the end of each dedupe cycle (the graph also doesn't show that the cycle can be interrupted and restarted, and will seek out productive dedupes again, but this is already straying far from the original "benchmark" concept...)

@Zygo
Copy link
Owner

Zygo commented Mar 17, 2025

Here's another graph comparing out-of-band dedupe strategies. The data set is a 3.5 GiB (7 GiB uncompressed) tree of files taken from a Debian root filesystem and copied once (i.e. every file has at least two full-file duplicates on the filesystem). This data set is designed to allow file-based dedupers to participate.

Image

Observations:

  • File-based dedupers can use file metadata (particularly the file size) to quickly and cheaply eliminate large amounts of unique data and speed up their scanning phase. Only faster-dupemerge and rmlint seem to make use of this optimization.
  • A file-based deduper can also speed up the dedupe phase by using hardlinks instead of reflinks (if the application can tolerate hardlinks, i.e. the files are read-only). Some dedupers in this test support both modes, and they contribute two curves to the above graph.
  • rmlint uses the same file-size-sort that faster-dupemerge does, but everything rmlint does is about 3x slower than faster-dupemerge. I've omitted the rmlint curve from this graph because it has the same shape and height as the faster-dupemerge curve, but rmlint's 27-minute runtime would blow out the X axis scale.
  • jdupes in hardlink mode has the steepest slope, meaning it frees space the fastest between the first and last freed block--but it starts late, due to all the hashing and sorting required, so it comes in 3rd, seconds before faster-dupemerge catches up.
  • duperemove has extensive optimisations in its block-reading code, and uses multiple IO threads for scanning. These brute-force tactics allow it to be the fastest deduper for about 30 seconds of its 4-minute run time, between when it pulls ahead of bees and faster-dupemerge, and 40 seconds later when bees catches up.
  • jdupes does not have extensive optimisations in its file-reading code, and uses only one thread for IO. This slows down the scanning phase so much that two other dedupers have completely finished before jdupes frees its first byte.
  • faster-dupemerge and bees both sort dedupable objects by size during processing, so they free more space on the left side of the graph (i.e. sooner). In the time it takes duperemove to free its first byte, bees and faster-dupemerge have freed about 1.6 GiB, about 30% of the total dedupe available.
  • duperemove's fdupes mode and jdupes -B (reflink) mode both suffer from lack of explicit preread and efficient reading in the btrfs dedupe ioctl. This pushes them toward the right side of the graph compared to bees and duperemove (which preread data into the page cache explicitly), and all the hardlinking dedupers (which use normal read/link/unlink/rename operations, not a special dedupe ioctl). The btrfs dedupe ioctl was noticeably faster on older kernels, before btrfs's dedupe call was replaced with copy_file_range.
  • dduper is on this graph, but it didn't free any space in the first 90 minutes. As I write this, I'm still waiting for dduper to free its first byte. The concept of dduper is brilliant: in theory, a csum-based deduper should be capable of scanning this entire test filesystem in a few dozen milliseconds. Someone needs to implement that concept in a functional deduper.
  • bees is the only deduper on this graph capable of deduping extents with mixed unique and duplicate data, so it gets a little more space than everything else, even in this tiny data set.
  • The two copies of every file are not quite equal:
# compsize -b t1
Processed 215467 files, 150521 regular extents (150718 refs), 94996 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       49%     3666380906   7390798669   7413764941
none       100%     1135365016   1135365016   1138502552
zlib        40%     2531015890   6255433653   6275262389      <-- 40.461% compression
# compsize -b t2
Processed 215467 files, 150505 regular extents (150702 refs), 94996 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       49%     3667347562   7390798669   7413764941
none       100%     1137589144   1137589144   1140726680
zlib        40%     2529758418   6253209525   6273038261     <-- 40.455% compression

This difference in compressed data size results in slightly different total amounts of data deduped between the file-oriented dedupers, depending on which of the duplicated files they chose to keep. That effect was not intentional, and it's a good example of the challenges that arise when trying to benchmark dedupers.

@kakra
Copy link
Contributor

kakra commented Mar 17, 2025

These are all very nice diagrams which show the performance (both effectiveness and speed) of bees compared to other dedupers.

What I get from these graphs is:

  • I probably should expect that bees has higher impact on system performance for a shorter time
  • bees is probably mostly idle afterwards and is able to catch up fast on new data

However, it doesn't tell me what impact the result of deduplication has on the system long-term. I'm not sure what the intention of the original question was, but maybe its about those three concerns:

  • How does bees impact system performance while it works?
  • How does bees impact system performance white it is idle?
  • How does bees impact system performance long-term?

These are mostly subjective and probably hard to measure.

We got one observation so far: bees compares very well to other dedupers in both speed and effectiveness.

From my own observations I can confirm that bees is quite hard on system performance during the first pass through the dataset. This is probably mostly irrelevant for servers, but it has a very noticeable effect on interactive desktop performance. For me, that process took 2-4 weeks.

From another observation I can confirm that bees is very effective: For a full restore of a failed server, we installed bees first before restoring the data (we are running a thin base OS, everything else is containers and data storage). The restored raw size of the storage is bigger than the physical space available. Bees was able to deduplicate the data during restore down to a size of around 50% of the physical total space. This has been very impressive. Restore took around 24 hours. The dataset contained 1+ TB of mailboxes and various web sites with multiple instances of different CMS and CMS versions.

From my personal subjective perspective, I can confirm that overall system behavior with bees 0.11 is much better than with older versions. To fully get the benefits of bees 0.11, I defragmented most of my data once (to make use of its tendency to prefer larger extents). But during the process, desktop performance was heavily impacted.

Maybe a compsize of the results and some other metrics could give some more insights:

  • How many refs have been created?
  • How big are extents on average? (maybe different percentiles)
  • Desktops: Do we see a difference in application startup times if simulating a desktop? (I think Phoronix Suite can do some of those)
  • VM storages: How fast does a VM boot before and after?
  • File servers: How fast can you copy a file to and from a storage before and after?

What do you think? Could this make interesting results? The OPs question is very terse...

@Zygo
Copy link
Owner

Zygo commented Mar 17, 2025

How many refs have been created?

At scale, this is proportional to the data size (a proxy for how many extents are removed) and metadata size (a proxy for how many duplicate refs have been created).

How big are extents on average? (maybe different percentiles)

That's very data-dependent, to the point where a server's application workload can be inferred from its distribution of extent sizes. Before/after ratios and curves could be useful, but any effect from bees would be mixed in with general filesystem aging effects if the measurements were collected over a long period of time.

Desktops: Do we see a difference in application startup times if simulating a desktop? (I think Phoronix Suite can do some of those)
VM storages: How fast does a VM boot before and after?
File servers: How fast can you copy a file to and from a storage before and after?

btrfs is highly variable in all of these metrics after operations like balance and snapshot delete. bees is another metadata-intensive operation that will generate big random swings in btrfs performance, similar to the aftermath of a snapshot delete but spread out over more time.

I can add a few key metrics:

  • Sequential read performance before/after. This is a general dedupe issue, when logically adjacent extents cease to be physically adjacent. This measurement needs a standardised corpus (specifying both the data and the IO pattern) to work, as every data set will be different.
  • Allocation (write) performance before/after. Deduplication creates a lot of free space holes, which have to be filled by fragments or balanced away. Both tactics have different performance curves.
  • Transaction latency during dedupe (latency after dedupe shows up in allocation performance data). bees has some effect here, but not as large as a continuous reader thread, a snapshot delete, or a balance.
  • Kernel scheduling latency. LOGICAL_INO is slow, CPU intensive, and runs in a kernel thread. Would it be better to implement that in userspace instead, or improve the kernel implementation?

but there are confounding factors:

  • Result interpretation can be goal-dependent. e.g. the goal of high sequential read performance can conflict with the goal of extent sharing. bees picked a point in between good dedupe and good sequential read performance. If that point becomes the subject of benchmarks, then it should be configurable, and that multiplies the number of configurations to benchmark.
  • Comparison of some results needs a standardised data corpus to work, as every user's data set will be different. DEDISbench is a good start, but it needs to be finished.
  • Measurement on SSDs is problematic due to variances arising from SSD aging and write multiplication. Running an identical set of operations can produce results with a 3:1 ratio between highest and lowest execution times on good drives, and 100:1 on bad ones.
  • Measurement on HDDs is more stable, but produces results irrelevant to SSD users.
  • Measurement on RAM is most stable, but produces results irrelevant to users with data sets too large to fit in RAM.
  • Measurement is expensive. It can take months on a big data set before some of the long-term effects become apparent on btrfs, and a second instance of the same data set is required for calibration (i.e. would the effect have occurred anyway, without bees running?).
  • A small code tweak in bees can instantly make any data collection in progress obsolete.
  • Other effects in btrfs can dominate the results. We might simulate filesystem aging to see if bees makes it happen faster, and by what factor. We can only compare that to a filesystem with an identical workload of the same age.

The OPs question is very terse...

Indeed. Despite the opportunity to explore the state of the art in dedupe benchmarking, the OP's request is not feasible because it does not specify any constraints on the response.

I invite OP, or anyone else interested in this question, to:

  1. Pick a use case
  2. Determine which benchmark(s) are relevant to that use case
  3. Run the benchmark(s)
  4. Post here:
    a. the rationale for the benchmark's relevance
    b. documentation of the methodology
    c. the results

@Zygo
Copy link
Owner

Zygo commented Mar 21, 2025

Added fclones...

Image

Not the result I was expecting! fclones in hardlink mode tied for 3rd place. The reflink mode looks fast, but it uses the unsafe FICLONE ioctl, so it's not directly comparable to duperemove or jdupes -B.

Normally I would exclude dedupers using FICLONE from bake-offs, but FICLONE does provide an upper bound on FIDEDUPERANGE performance. Assuming no other changes, if fclones adopts FIDEDUPERANGE, the line can't move to the left of where it is with FICLONE--but it would become safe to use fclones with live data.

The slope of the graph after fclones finishes its scan phase is amazing--almost vertical. If we count only the time fclones dedupe spends, it's faster than anything else--but the scan phase in fclones group takes so long that other dedupers have almost finished or already finished running.

@kakra
Copy link
Contributor

kakra commented Mar 21, 2025

I think part of the reason is: FICLONE just clones and doesn't check for identical contents. I'd say fclones is a pretty dangerous tool unless you can ensure that none of the proposed file duplicates changed between sampling and application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants