Skip to content

Commit 19c24cc

Browse files
author
s2udayas
committed
Updating README and cleaning up
1 parent 439e9b3 commit 19c24cc

20 files changed

+173
-72
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,4 +39,6 @@
3939

4040
results.txt
4141
results_*.txt
42-
results_graph.png
42+
results_graph.png
43+
build/random_dataset/
44+
build/hashes_*

README.md

Lines changed: 149 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,163 @@
1-
# Information
2-
DedupBench is a benchmarking tool for data chunking techniques used in data deduplication. DedupBench is designed for extensibility, allowing new chunking techniques to be implemented with minimal additional code. DedupBench is also designed to be used with generic datasets, allowing for the comparison of a large number of data chunking techniques.
1+
![DedupBench Logo](images/dedupbench_logo.png)
32

4-
DedupBench currently supports many state-of-the-art data chunking and hashing algorithms. Please cite the relevant publications from this list if you use the code from this repository:
3+
<h2><p align="center">Benchmarking Chunking Techniques for Data Deduplication</p></h2>
54

5+
<h3><p align="center">
6+
<a href="#-quick-start-guide"> 🚀 Quick Start</a> | <a href="#news">⭐News</a> | <a href="#-research-papers"> 🔖 Research Papers </a> | <a href="https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25"> 💾 VM Images Dataset </a> | <a href="#faq">❓FAQ </a> | <a href="#contact"> 💂‍♂️ People </a>
7+
</p></h3>
8+
9+
# 🎉 Introduction
10+
11+
DedupBench is a benchmarking tool for data chunking techniques used in data deduplication. It is designed for extensibility, allowing new chunking and fingerprinting techniques to be implemented with minimal additional code. DedupBench is designed to be used with any dataset, allowing for the quick comparison of a large number of chunking techniques on user-specified data.
12+
13+
It currently supports eleven different chunking algorithms and six different fingerprinting algorithms. It supports SIMD acceleration for these algorithms with five different vector instruction sets on Intel, AMD, ARM, and IBM CPUs.
14+
15+
The following chunking techniques and SIMD accelerations are currently supported by DedupBench.
16+
17+
| CDC Algorithm | Link | Unaccelerated | SSE-128 | AVX-256 | AVX-512 | NEON-128 (ARM) | VSX-128 (IBM) |
18+
| :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: |
19+
| AE-Max | [Paper](https://ieeexplore.ieee.org/document/7218510) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
20+
| AE-Min | [Paper](https://ieeexplore.ieee.org/document/7218510) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
21+
| CRC-32 | [GitHub](https://github.com/google/crc32c) | ✔️ ||||||
22+
| FastCDC | [Paper](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia) | ✔️ ||||||
23+
| Fixed-size | [Paper](https://www.usenix.org/conference/fast-02/venti-new-approach-archival-data-storage) | ✔️ ||||||
24+
| Gear | [Paper](https://www.sciencedirect.com/science/article/pii/S0166531614000790) | ✔️ ||||||
25+
| MAXP| [Paper](https://www.sciencedirect.com/science/article/pii/S0022000009000580) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
26+
| Rabin | [Paper](https://dl.acm.org/doi/abs/10.1145/502034.502052) | ✔️ ||||||
27+
| RAM | [Paper](https://www.sciencedirect.com/science/article/abs/pii/S0167739X16305829) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
28+
| SeqCDC | [Paper](https://dl.acm.org/doi/10.1145/3652892.3700766) | ✔️ ||||||
29+
| TTTD | [Paper](https://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf) | ✔️ ||||||
30+
31+
# ⭐News
32+
- *Aug. 2025*: We have released DedupBench v2.0 with ARM / IBM vector acceleration support, xxHash compatibility and much more!
33+
- *Feb. 2025*: VectorCDC has been published in [FAST](https://www.usenix.org/conference/fast25/presentation/udayashankar)!
34+
- *Jan. 2025*: We have released the [DEB dataset](https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25) on Kaggle
35+
36+
37+
# 🚀 Quick start guide
38+
To quickly get started, run the following commands on Ubuntu:
39+
40+
1. Clone repository and create a basic build without SIMD acceleration.
641
```
7-
[1] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2025, February. VectorCDC: Accelerating Data Deduplication with SSE/AVX Instructions. In 2025 USENIX 23rd Conference on File and Storage Technologies (FAST). USENIX
8-
[2] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, December. SeqCDC: Hashless Content-Defined Chunking for Data Deduplication. In 2024 ACM/IFIP 25th International Middleware Conference (MIDDLEWARE). ACM
9-
[3] Jarah, MA., Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, July. The Impact of Low-Entropy on Chunking Techniques for Data Deduplication. In 2024 IEEE 17th International Conference on Cloud Computing (CLOUD) (pp. 134-140). IEEE.
10-
[4] Liu, A., Baba, A., Udayashankar, S. and Al-Kiswany, S., 2023, September. DedupBench: A Benchmarking Tool for Data Chunking Techniques. In 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 469-474). IEEE.
42+
git clone [email protected]:UWASL/dedup-bench.git
43+
cd dedup-bench/
44+
sh ./install.sh
1145
```
46+
1247

13-
# Installation
14-
1. Install prerequisites. Note that these commands are for Ubuntu 22.04.
15-
```
16-
sudo apt update
17-
sudo apt install libssl-dev
18-
sudo apt install python3
19-
sudo apt install python3-pip
20-
python3 -m pip install matplotlib
21-
python3 -m pip install seaborn
22-
```
23-
2. Clone and build the repository.
48+
3. Run a preconfigured run with 8KB average chunk sizes and unaccelerated algorithms.
49+
```
50+
cd build/
51+
./dedup_script.sh -c unaccelerated_8kb random_dataset
52+
python3 plot_results.py results.txt
53+
```
54+
55+
This should generate graphs titled _results_graph.png_ similar to the one below. Note that the space savings will be zero for all algorithms, as the default run uses the random dataset.
56+
57+
To see a real dataset in action and generate the graph below, download and use the _DEB_ dataset used in our Middleware 2024 / FAST 2025 papers from [💾 VM Images Dataset](https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25). This graph is from an AMD EPYC Rome machine.
58+
59+
![DEB-dataset results](images/sample_results_graph.png)
60+
61+
62+
# ⚡ DedupBench SIMD Builds
63+
64+
To use any of the vector-accelerated CDC algorithms, an alternative Dedupbench build is required. We have provided preconfigured files for all algorithms with 8KB chunk sizes for convenience.
65+
66+
**_Note that building with the wrong options (such as AVX-256 on a machine without AVX-256 support) may result in compile / runtime errors._**
67+
68+
### 🔥 SSE/AVX-256 Acceleration
69+
This build needs an AVX-256 compatible CPU to work correctly.
2470
```
25-
git clone https://github.com/UWASL/dedup-bench.git
26-
cd dedup-bench/build/
71+
cd build/
2772
make clean
28-
make
73+
make simd_all
74+
75+
./dedup_script.sh -c simd_8kb random_dataset
76+
python3 plot_results.py results.txt
2977
```
30-
3. If AVX-512 support is required, these are the alternative build commands. **_Note that building with this option on a machine without AVX-512 support will result in runtime errors._**
78+
### 🌀 AVX-512 Acceleration
79+
This build needs an AVX-512 compatible CPU to work correctly.
3180
```
81+
cd build/
3282
make clean
33-
make EXTRA_COMPILER_FLAGS='-mavx512f -mavx512vl -mavx512bw'
34-
```
35-
4. Generate a dataset consisting of random data for testing. This generates three 1GB files with random ASCII characters on Ubuntu 22.04.
36-
```
37-
mkdir random_dataset
38-
cd random_dataset/
39-
base64 /dev/urandom | head -c 1000000000 > random_1.txt
40-
base64 /dev/urandom | head -c 1000000000 > random_2.txt
41-
base64 /dev/urandom | head -c 1000000000 > random_3.txt
42-
```
43-
Alternatively, download and use the _DEB_ dataset used in our Middleware 2024 / FAST 2025 papers from [Kaggle](https://www.kaggle.com/datasets/sreeharshau/vm-deb-fast25).
83+
make simd512_all
4484
45-
# Running dedup-bench
46-
This section describes how to run dedup-bench. You can run dedup-bench using our preconfigured scripts for 8KB chunks or manually if you want custom techniques/chunk sizes.
85+
./dedup_script.sh -c simd512_8kb random_dataset
86+
python3 plot_results.py results.txt
87+
```
4788

48-
## Preconfigured Run - 8 KB chunks
49-
We have created scripts to run dedup-bench with an 8KB average chunk size on any given dataset. These commands run all the CDC techniques shown in the VectorCDC paper from FAST 2025.
50-
1. Go into the dedup-bench build directory.
89+
## 🚴 Basic Unaccelerated Build
90+
This unaccelerated build should work on all machines regardless of CPU capabilities.
5191
```
52-
cd <dedup_bench_root_dir>/build/
92+
cd build/
93+
make clean
94+
make
95+
96+
./dedup_script.sh -c unaccelerated_8kb random_dataset
97+
python3 plot_results.py results.txt
98+
```
99+
100+
## 🔨 Alternate builds for ARM / IBM
101+
Note that we have not provided configuration file examples for these. Please refer to the custom runs section in the [❓FAQ](#faq).
102+
#### ARM with NEON-128 instructions
103+
```
104+
cd build/
105+
make clean
106+
make arm_neon128
107+
```
108+
#### IBM with VSX-128 / AltiVec instructions
53109
```
54-
2. Run dedup-script with your chosen dataset. Replace `<path_to_dataset>` with the directory of the random dataset you previously created / any other dataset of your choice. **_Note that VRAM-512 will not run when compiled without AVX-512 support_**.
110+
cd build/
111+
make clean
112+
make ibm_altivec128
55113
```
56-
./dedup_script.sh -t 8kb_fast25 <path_to_dataset>
114+
115+
# 🔖 Research Papers
116+
117+
Please cite the relevant publications from this list if you use the code from this repository:
118+
119+
### Vectorized algorithms / DEB dataset
57120
```
58-
3. Plot a graph with the throughput results from all CDC algorithms (including VRAM) on your dataset. The graph is saved in `results_graph.png`.
121+
[1] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2025, February. VectorCDC: Accelerating Data Deduplication with SSE/AVX Instructions. In 2025 USENIX 23rd Conference on File and Storage Technologies (FAST). USENIX
59122
```
60-
python3 plot_throughput_graph.py results.txt
123+
### SeqCDC
124+
```
125+
[2] Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, December. SeqCDC: Hashless Content-Defined Chunking for Data Deduplication. In 2024 ACM/IFIP 25th International Middleware Conference (MIDDLEWARE). ACM
126+
```
127+
### Low Entropy Analysis
128+
```
129+
[3] Jarah, MA., Udayashankar, S., Baba, A. and Al-Kiswany, S., 2024, July. The Impact of Low-Entropy on Chunking Techniques for Data Deduplication. In 2024 IEEE 17th International Conference on Cloud Computing (CLOUD) (pp. 134-140). IEEE.
130+
```
131+
### DedupBench Original Paper
132+
```
133+
[4] Liu, A., Baba, A., Udayashankar, S. and Al-Kiswany, S., 2023, September. DedupBench: A Benchmarking Tool for Data Chunking Techniques. In 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 469-474). IEEE.
134+
```
135+
136+
# 💂‍♂️ People
137+
138+
For additional information, contact us via email:
139+
- Sreeharsha Udayashankar: [email protected]
140+
- Abdelrahman Baba: [email protected]
141+
- Mu'men Al-Jarah: [email protected]
142+
- Samer Al-Kiswany: [email protected]
143+
144+
# ❓FAQ
145+
146+
## How do I run the experiments from your Paper X?
147+
We provide configuration files in `build/` for the experiments from our papers. You can use `dedup_script` with the correct configuration.
148+
```
149+
./dedup_script.sh -c 8kb_fast25 <path-to-dataset>
150+
python3 plot_results.py results.txt
61151
```
62152

63153

64-
## Manual Runs - Custom techniques/chunk sizes
65-
1. Choose the required chunking, hashing techniques, and chunk sizes by modifying `config.txt`. The default configuration runs SeqCDC with an average chunk size of 8 KB. Supported parameter values are given in the next section and sample config files are available in `build/config_8kb_fast25/`.
154+
## How do I run custom techniques / chunk sizes?
155+
1. Choose the required chunking, hashing techniques, and chunk sizes by modifying `config.txt`.
66156
```
67-
cd <dedup_bench_repo_dir>/build/
157+
cd build/
68158
vim config.txt
69159
```
70-
2. Run dedup-bench. Note that the path to be passed is a directory and that the output is generated in a file `hash.out`. Throughput and avg chunk size are printed to stdout.
160+
2. Run the dedup-bench binary directly. Note that the path to be passed is a **directory containing all the dataset files** and that the output is generated in a file `hash.out`. Throughput and avg chunk size are printed to stdout.
71161
```
72162
./dedup.exe <path_to_random_dataset_dir> config.txt
73163
```
@@ -76,16 +166,16 @@ We have created scripts to run dedup-bench with an 8KB average chunk size on any
76166
./measure-dedup.exe hash.out
77167
```
78168

79-
# Supported Chunking and Hashing Techniques
169+
## How do I modify config.txt for custom runs?
80170

81-
Here are some hints using which `config.txt` can be modified.
171+
### Chunking techniques (CDC algorithms)
82172

83-
### Chunking Techniques
84-
The following chunking techniques are currently supported by DedupBench. Note that the `chunking_algo` parameter in the configuration file needs to be edited to switch techniques.
173+
Note that the `chunking_algo` parameter in the configuration file needs to be edited to switch CDC techniques.
85174

86175
| Chunking Technique | chunking_algo |
87176
|--------------------|---------------|
88177
| AE | ae |
178+
| CRC32 | crc |
89179
| FastCDC | fastcdc |
90180
| Gear Chunking | gear |
91181
| Rabin's Chunking | rabins |
@@ -96,14 +186,16 @@ The following chunking techniques are currently supported by DedupBench. Note th
96186
After choosing a `chunking_algo`, make sure to check and adjust its parameters (e.g. chunk sizes). _Note that each `chunking_algo` has a separate parameter section in the config file_. For example, SeqCDC's minimum and maximum chunk sizes are called `seq_min_block_size` and `seq_max_block_size` respectively.
97187

98188
### SSE / AVX Acceleration
99-
To use VectorCDC's RAM (VRAM), set `chunking_algo` to point to RAM and change `simd_mode` to one of the following values:
189+
To change the SIMD acceleration used, change `simd_mode` to one of the following values:
100190
| SIMD Mode | simd_mode |
101191
|-----------|-----------|
102192
| SSE128 | sse128 |
103193
| AVX256 | avx256 |
104194
| AVX512 | avx512 |
195+
| ARM NEON | neon128 |
196+
| IBM VSX | altivec128 |
105197

106-
Note that only RAM currently supports SSE/AVX acceleration. dedup-bench must be compiled with AVX-512 support to use the `avx512` mode.
198+
Note that only RAM, AE, and MAXP currently support SSE/AVX acceleration. dedup-bench must be compiled with AVX-512 support to use the `avx512` mode.
107199

108200
### Hashing Techniques
109201
The following hashing techniques are currently supported by DedupBench. Note that the `hashing_algo` parameter in the configuration file needs to be edited to switch techniques.
@@ -114,11 +206,12 @@ The following hashing techniques are currently supported by DedupBench. Note tha
114206
| SHA1 | sha1 |
115207
| SHA256 | sha256 |
116208
| SHA512 | sha512 |
117-
209+
| MurmurHash3 (128-bit) | murmurhash3 |
210+
| xxHash3 (128-bit) | xxhash128 |
118211

119-
# VM Dataset from DedupBench 2023:
212+
# Where is the VM Dataset used in the DedupBench 2023 paper?
120213

121-
The following images from Bitnami were used in the original DedupBench paper at CCECE 2023:
214+
Note that this is **not the same** as DEB. The following images from Bitnami were used in the original DedupBench paper at CCECE 2023:
122215

123216
### Image URLs
124217
```
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)