You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DedupBench is a benchmarking tool for data chunking techniques used in data deduplication. DedupBench is designed for extensibility, allowing new chunking techniques to be implemented with minimal additional code. DedupBench is also designed to be used with generic datasets, allowing for the comparison of a large number of data chunking techniques.
1
+

3
2
4
-
DedupBench currently supports many state-of-the-art data chunking and hashing algorithms. Please cite the relevant publications from this list if you use the code from this repository:
3
+
<h2><palign="center">Benchmarking Chunking Techniques for Data Deduplication</p></h2>
5
4
5
+
<h3><palign="center">
6
+
<ahref="#-quick-start-guide"> 🚀 Quick Start</a> | <ahref="#news">⭐News</a> | <ahref="#-research-papers"> 🔖 Research Papers </a> | <ahref="https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25"> 💾 VM Images Dataset </a> | <ahref="#faq">❓FAQ </a> | <ahref="#contact"> 💂♂️ People </a>
7
+
</p></h3>
8
+
9
+
# 🎉 Introduction
10
+
11
+
DedupBench is a benchmarking tool for data chunking techniques used in data deduplication. It is designed for extensibility, allowing new chunking and fingerprinting techniques to be implemented with minimal additional code. DedupBench is designed to be used with any dataset, allowing for the quick comparison of a large number of chunking techniques on user-specified data.
12
+
13
+
It currently supports eleven different chunking algorithms and six different fingerprinting algorithms. It supports SIMD acceleration for these algorithms with five different vector instruction sets on Intel, AMD, ARM, and IBM CPUs.
14
+
15
+
The following chunking techniques and SIMD accelerations are currently supported by DedupBench.
-*Aug. 2025*: We have released DedupBench v2.0 with ARM / IBM vector acceleration support, xxHash compatibility and much more!
33
+
-*Feb. 2025*: VectorCDC has been published in [FAST](https://www.usenix.org/conference/fast25/presentation/udayashankar)!
34
+
-*Jan. 2025*: We have released the [DEB dataset](https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25) on Kaggle
35
+
36
+
37
+
# 🚀 Quick start guide
38
+
To quickly get started, run the following commands on Ubuntu:
39
+
40
+
1. Clone repository and create a basic build without SIMD acceleration.
6
41
```
7
-
[1] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2025, February. VectorCDC: Accelerating Data Deduplication with SSE/AVX Instructions. In 2025 USENIX 23rd Conference on File and Storage Technologies (FAST). USENIX
8
-
[2] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, December. SeqCDC: Hashless Content-Defined Chunking for Data Deduplication. In 2024 ACM/IFIP 25th International Middleware Conference (MIDDLEWARE). ACM
9
-
[3] Jarah, MA., Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, July. The Impact of Low-Entropy on Chunking Techniques for Data Deduplication. In 2024 IEEE 17th International Conference on Cloud Computing (CLOUD) (pp. 134-140). IEEE.
10
-
[4] Liu, A., Baba, A., Udayashankar, S. and Al-Kiswany, S., 2023, September. DedupBench: A Benchmarking Tool for Data Chunking Techniques. In 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 469-474). IEEE.
This should generate graphs titled _results_graph.png_ similar to the one below. Note that the space savings will be zero for all algorithms, as the default run uses the random dataset.
56
+
57
+
To see a real dataset in action and generate the graph below, download and use the _DEB_ dataset used in our Middleware 2024 / FAST 2025 papers from [💾 VM Images Dataset](https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25). This graph is from an AMD EPYC Rome machine.
To use any of the vector-accelerated CDC algorithms, an alternative Dedupbench build is required. We have provided preconfigured files for all algorithms with 8KB chunk sizes for convenience.
65
+
66
+
**_Note that building with the wrong options (such as AVX-256 on a machine without AVX-256 support) may result in compile / runtime errors._**
67
+
68
+
### 🔥 SSE/AVX-256 Acceleration
69
+
This build needs an AVX-256 compatible CPU to work correctly.
3. If AVX-512 support is required, these are the alternative build commands. **_Note that building with this option on a machine without AVX-512 support will result in runtime errors._**
78
+
### 🌀 AVX-512 Acceleration
79
+
This build needs an AVX-512 compatible CPU to work correctly.
31
80
```
81
+
cd build/
32
82
make clean
33
-
make EXTRA_COMPILER_FLAGS='-mavx512f -mavx512vl -mavx512bw'
34
-
```
35
-
4. Generate a dataset consisting of random data for testing. This generates three 1GB files with random ASCII characters on Ubuntu 22.04.
36
-
```
37
-
mkdir random_dataset
38
-
cd random_dataset/
39
-
base64 /dev/urandom | head -c 1000000000 > random_1.txt
40
-
base64 /dev/urandom | head -c 1000000000 > random_2.txt
41
-
base64 /dev/urandom | head -c 1000000000 > random_3.txt
42
-
```
43
-
Alternatively, download and use the _DEB_ dataset used in our Middleware 2024 / FAST 2025 papers from [Kaggle](https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25).
83
+
make simd512_all
44
84
45
-
# Running dedup-bench
46
-
This section describes how to run dedup-bench. You can run dedup-bench using our preconfigured scripts for 8KB chunks or manually if you want custom techniques/chunk sizes.
85
+
./dedup_script.sh -c simd512_8kb random_dataset
86
+
python3 plot_results.py results.txt
87
+
```
47
88
48
-
## Preconfigured Run - 8 KB chunks
49
-
We have created scripts to run dedup-bench with an 8KB average chunk size on any given dataset. These commands run all the CDC techniques shown in the VectorCDC paper from FAST 2025.
50
-
1. Go into the dedup-bench build directory.
89
+
## 🚴 Basic Unaccelerated Build
90
+
This unaccelerated build should work on all machines regardless of CPU capabilities.
Note that we have not provided configuration file examples for these. Please refer to the custom runs section in the [❓FAQ](#faq).
102
+
#### ARM with NEON-128 instructions
103
+
```
104
+
cd build/
105
+
make clean
106
+
make arm_neon128
107
+
```
108
+
#### IBM with VSX-128 / AltiVec instructions
53
109
```
54
-
2. Run dedup-script with your chosen dataset. Replace `<path_to_dataset>` with the directory of the random dataset you previously created / any other dataset of your choice. **_Note that VRAM-512 will not run when compiled without AVX-512 support_**.
110
+
cd build/
111
+
make clean
112
+
make ibm_altivec128
55
113
```
56
-
./dedup_script.sh -t 8kb_fast25 <path_to_dataset>
114
+
115
+
# 🔖 Research Papers
116
+
117
+
Please cite the relevant publications from this list if you use the code from this repository:
118
+
119
+
### Vectorized algorithms / DEB dataset
57
120
```
58
-
3. Plot a graph with the throughput results from all CDC algorithms (including VRAM) on your dataset. The graph is saved in `results_graph.png`.
121
+
[1] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2025, February. VectorCDC: Accelerating Data Deduplication with SSE/AVX Instructions. In 2025 USENIX 23rd Conference on File and Storage Technologies (FAST). USENIX
59
122
```
60
-
python3 plot_throughput_graph.py results.txt
123
+
### SeqCDC
124
+
```
125
+
[2] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, December. SeqCDC: Hashless Content-Defined Chunking for Data Deduplication. In 2024 ACM/IFIP 25th International Middleware Conference (MIDDLEWARE). ACM
126
+
```
127
+
### Low Entropy Analysis
128
+
```
129
+
[3] Jarah, MA., Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, July. The Impact of Low-Entropy on Chunking Techniques for Data Deduplication. In 2024 IEEE 17th International Conference on Cloud Computing (CLOUD) (pp. 134-140). IEEE.
130
+
```
131
+
### DedupBench Original Paper
132
+
```
133
+
[4] Liu, A., Baba, A., Udayashankar, S. and Al-Kiswany, S., 2023, September. DedupBench: A Benchmarking Tool for Data Chunking Techniques. In 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 469-474). IEEE.
## How do I run the experiments from your Paper X?
147
+
We provide configuration files in `build/` for the experiments from our papers. You can use `dedup_script` with the correct configuration.
148
+
```
149
+
./dedup_script.sh -c 8kb_fast25 <path-to-dataset>
150
+
python3 plot_results.py results.txt
61
151
```
62
152
63
153
64
-
## Manual Runs - Custom techniques/chunk sizes
65
-
1. Choose the required chunking, hashing techniques, and chunk sizes by modifying `config.txt`. The default configuration runs SeqCDC with an average chunk size of 8 KB. Supported parameter values are given in the next section and sample config files are available in `build/config_8kb_fast25/`.
154
+
## How do I run custom techniques / chunk sizes?
155
+
1. Choose the required chunking, hashing techniques, and chunk sizes by modifying `config.txt`.
66
156
```
67
-
cd <dedup_bench_repo_dir>/build/
157
+
cd build/
68
158
vim config.txt
69
159
```
70
-
2. Run dedup-bench. Note that the path to be passed is a directory and that the output is generated in a file `hash.out`. Throughput and avg chunk size are printed to stdout.
160
+
2. Run the dedup-bench binary directly. Note that the path to be passed is a **directory containing all the dataset files** and that the output is generated in a file `hash.out`. Throughput and avg chunk size are printed to stdout.
@@ -76,16 +166,16 @@ We have created scripts to run dedup-bench with an 8KB average chunk size on any
76
166
./measure-dedup.exe hash.out
77
167
```
78
168
79
-
#Supported Chunking and Hashing Techniques
169
+
## How do I modify config.txt for custom runs?
80
170
81
-
Here are some hints using which `config.txt` can be modified.
171
+
### Chunking techniques (CDC algorithms)
82
172
83
-
### Chunking Techniques
84
-
The following chunking techniques are currently supported by DedupBench. Note that the `chunking_algo` parameter in the configuration file needs to be edited to switch techniques.
173
+
Note that the `chunking_algo` parameter in the configuration file needs to be edited to switch CDC techniques.
85
174
86
175
| Chunking Technique | chunking_algo |
87
176
|--------------------|---------------|
88
177
| AE | ae |
178
+
| CRC32 | crc |
89
179
| FastCDC | fastcdc |
90
180
| Gear Chunking | gear |
91
181
| Rabin's Chunking | rabins |
@@ -96,14 +186,16 @@ The following chunking techniques are currently supported by DedupBench. Note th
96
186
After choosing a `chunking_algo`, make sure to check and adjust its parameters (e.g. chunk sizes). _Note that each `chunking_algo` has a separate parameter section in the config file_. For example, SeqCDC's minimum and maximum chunk sizes are called `seq_min_block_size` and `seq_max_block_size` respectively.
97
187
98
188
### SSE / AVX Acceleration
99
-
To use VectorCDC's RAM (VRAM), set `chunking_algo` to point to RAM and change `simd_mode` to one of the following values:
189
+
To change the SIMD acceleration used, change`simd_mode` to one of the following values:
100
190
| SIMD Mode | simd_mode |
101
191
|-----------|-----------|
102
192
| SSE128 | sse128 |
103
193
| AVX256 | avx256 |
104
194
| AVX512 | avx512 |
195
+
| ARM NEON | neon128 |
196
+
| IBM VSX | altivec128 |
105
197
106
-
Note that only RAMcurrently supports SSE/AVX acceleration. dedup-bench must be compiled with AVX-512 support to use the `avx512` mode.
198
+
Note that only RAM, AE, and MAXP currently support SSE/AVX acceleration. dedup-bench must be compiled with AVX-512 support to use the `avx512` mode.
107
199
108
200
### Hashing Techniques
109
201
The following hashing techniques are currently supported by DedupBench. Note that the `hashing_algo` parameter in the configuration file needs to be edited to switch techniques.
@@ -114,11 +206,12 @@ The following hashing techniques are currently supported by DedupBench. Note tha
114
206
| SHA1 | sha1 |
115
207
| SHA256 | sha256 |
116
208
| SHA512 | sha512 |
117
-
209
+
| MurmurHash3 (128-bit) | murmurhash3 |
210
+
| xxHash3 (128-bit) | xxhash128 |
118
211
119
-
# VM Dataset from DedupBench 2023:
212
+
# Where is the VM Dataset used in the DedupBench 2023 paper?
120
213
121
-
The following images from Bitnami were used in the original DedupBench paper at CCECE 2023:
214
+
Note that this is **not the same** as DEB. The following images from Bitnami were used in the original DedupBench paper at CCECE 2023:
0 commit comments