-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is the affect on system performance? #305
Comments
While bees itself consumes some resources in the background (and it's usually not noticeable, except for the first full pass), I can at least confirm that since v0.11, bees does a very good job of reducing latency of IO requests by coalescing extents which previously fragmented in a lot of small extents. This is because bees now prefers the larger extents if multiple match. So if you combine that with selective defragmentation of some files, it is able to keep the filesystem in a good shape. But IO latency on the desktop is very subjective and hard to measure, so I cannot provide any benchmarks either. I pretty sure Zygo posted some diagrams in the past showing how effective bees is - which could probably count as a benchmark in a broader sense. I wonder if there are diagrams about extent reduction or similar parameters. Bees writes a lot of live statistics to a file each second. You could probably extract values (from this file and other system metrics) and create your own diagrams. |
There seem to be 3 kinds of benchmarks:
Category 1 is not going to be very informative: the benchmark doesn't tell you about bees performance. Instead, the benchmark gives information about the performance of the rest of the system while bees is running. All things being equal, a system running bees will be slower than a system not running bees, and you can know that information without running any benchmark at all. The difference in performance depends on how aggressive bees is configured to be. Category 2 doesn't work for an out-of-band deduper, even when run concurrently with the write load. DEDISbench reports that the system has similar performance to a system that is not using in-band deduplication (because bees doesn't do in-band deduplication), and DEDISbench isn't running at all when bees does out-of-band deduplication. Category 3 is highly dependent on the data set, with the topology of the data set (size and locations of the duplicate blocks) and, especially on btrfs, the filesystem-specific capabilities of the deduper. Category 3 results are not numbers, but curves. Out-of-band dedupers use different strategies to search for duplicate extents, and this can result in time intervals where the deduper is running, but not releasing any space. A curve provides details such as:
Here's a graph showing rsync and bees running at the same time (red line) vs rsync without bees (green line): (note: since rsync is taking up space, the "space freed" is negative on this graph.) Not much impact until there is some duplicate data, then the rsync slows down a little while bees runs behind it to clean up duplicates. Here's a graph comparing DEDISbench's "ubuntu images" data set (red line) with some actual ubuntu images I downloaded (green line), both with bees running at the same time: DEDISbench's synthetic workload is unrealistic because its "duplicate" blocks are all the same, and unrealistically compressible. Compressed blocks can be deduped, but we don't save much space for each one. Real data has longer extents so the savings are more visible. This graph illustrates why the content of the test corpus matters when benchmarking a deduper. Statistically, according to some narrowly-defined metrics, the above two data sets are identical, but their real-world dedupe behavior is very different. The flat area on the right side of the green line corresponds to some nearly identical image files, so although the download rate was constant, the free space did not substantially decrease during the download because of concurrent deduplication. Here's a graph comparing 21 versions of the Ubuntu24 cloud image from the last year, with bees running at the same time as the download (red line) or after the download (green line): (For consistency, the images were "downloaded" from a local SSD.) There's not much space saved in this small, non-4K-aligned data set full of compressed files, but it does show that incremental dedupe-as-you-go does result in smaller peak data size and smaller final size--and that DEDISbench is unrealistic. Here's a graph showing a 154 GiB data set (Windows raw VM image files) that I've been using for some time to evaluate bees development changes and compare with other out-of-band block-based dedupers on btrfs. At the moment there is only one of those that is still maintained (duperemove). (Note: for consistency, all dedupers in this test are run starting with a pre-filled filesystem, so the "space freed" is positive in this graph.) This graph shows that there is no amount of time T such that duperemove releases more space than bees after running T seconds. The graph also highlight's bees's extent size sorting, which places the slow, unproductive tiny dedupes at the end of each dedupe cycle (the graph also doesn't show that the cycle can be interrupted and restarted, and will seek out productive dedupes again, but this is already straying far from the original "benchmark" concept...) |
Here's another graph comparing out-of-band dedupe strategies. The data set is a 3.5 GiB (7 GiB uncompressed) tree of files taken from a Debian root filesystem and copied once (i.e. every file has at least two full-file duplicates on the filesystem). This data set is designed to allow file-based dedupers to participate. Observations:
This difference in compressed data size results in slightly different total amounts of data deduped between the file-oriented dedupers, depending on which of the duplicated files they chose to keep. That effect was not intentional, and it's a good example of the challenges that arise when trying to benchmark dedupers. |
These are all very nice diagrams which show the performance (both effectiveness and speed) of bees compared to other dedupers. What I get from these graphs is:
However, it doesn't tell me what impact the result of deduplication has on the system long-term. I'm not sure what the intention of the original question was, but maybe its about those three concerns:
These are mostly subjective and probably hard to measure. We got one observation so far: bees compares very well to other dedupers in both speed and effectiveness. From my own observations I can confirm that bees is quite hard on system performance during the first pass through the dataset. This is probably mostly irrelevant for servers, but it has a very noticeable effect on interactive desktop performance. For me, that process took 2-4 weeks. From another observation I can confirm that bees is very effective: For a full restore of a failed server, we installed bees first before restoring the data (we are running a thin base OS, everything else is containers and data storage). The restored raw size of the storage is bigger than the physical space available. Bees was able to deduplicate the data during restore down to a size of around 50% of the physical total space. This has been very impressive. Restore took around 24 hours. The dataset contained 1+ TB of mailboxes and various web sites with multiple instances of different CMS and CMS versions. From my personal subjective perspective, I can confirm that overall system behavior with bees 0.11 is much better than with older versions. To fully get the benefits of bees 0.11, I defragmented most of my data once (to make use of its tendency to prefer larger extents). But during the process, desktop performance was heavily impacted. Maybe a compsize of the results and some other metrics could give some more insights:
What do you think? Could this make interesting results? The OPs question is very terse... |
At scale, this is proportional to the data size (a proxy for how many extents are removed) and metadata size (a proxy for how many duplicate refs have been created).
That's very data-dependent, to the point where a server's application workload can be inferred from its distribution of extent sizes. Before/after ratios and curves could be useful, but any effect from bees would be mixed in with general filesystem aging effects if the measurements were collected over a long period of time.
btrfs is highly variable in all of these metrics after operations like balance and snapshot delete. bees is another metadata-intensive operation that will generate big random swings in btrfs performance, similar to the aftermath of a snapshot delete but spread out over more time. I can add a few key metrics:
but there are confounding factors:
Indeed. Despite the opportunity to explore the state of the art in dedupe benchmarking, the OP's request is not feasible because it does not specify any constraints on the response. I invite OP, or anyone else interested in this question, to:
|
Added fclones... Not the result I was expecting! fclones in hardlink mode tied for 3rd place. The reflink mode looks fast, but it uses the unsafe Normally I would exclude dedupers using The slope of the graph after fclones finishes its scan phase is amazing--almost vertical. If we count only the time |
I think part of the reason is: |
I haven't seen a single benchmark yet.
The text was updated successfully, but these errors were encountered: