Skip to content

Commit 2d765dd

Browse files
KarthikSubbaraozackcammadolson
authored
'Introducing Bloom Filters for Valkey' Blog (#229)
### Description Adding a blog for the valkey-bloom module --------- Signed-off-by: KarthikSubbarao <[email protected]> Co-authored-by: zackcam <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
1 parent 4ee9881 commit 2d765dd

File tree

7 files changed

+191
-1
lines changed

7 files changed

+191
-1
lines changed

content/authors/karthiksubbarao.md

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
title: Karthik Subbarao
3+
extra:
4+
photo: '/assets/media/authors/karthiksubbarao.jpg'
5+
github: karthiksubbarao
6+
---
7+
8+
Karthik Subbarao is a Software Engineer at AWS who is passionate about distributed systems, databases, Rust, and, in general, innovating through software development / technology.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
+++
2+
title= "Introducing Bloom Filters for Valkey"
3+
description = "Learn how to use bloom filters to perform large-scale membership testing with significant memory savings."
4+
date= 2025-04-09 01:01:01
5+
authors= [ "karthiksubbarao"]
6+
+++
7+
8+
The Valkey project is introducing Bloom Filters as a new data type via [valkey-bloom](https://github.com/valkey-io/valkey-bloom/) (BSD-3 licensed), an official Valkey Module which is compatible with Valkey versions >= 8.0. Bloom filters provide efficient, large-scale membership testing, improving performance and offering significant memory savings for high-volume applications.
9+
10+
As an example, to handle advertisement deduplication workloads and answer the question, "Has this customer seen this ad before?", Valkey developers could use the SET data type.
11+
This is done by adding the customer IDs (of those who viewed an ad) into a `SET` object representing a particular advertisement. However, the problem with this approach is high memory usage since every item in the set is allocated.
12+
This article demonstrates how using the bloom filter data type from valkey-bloom can achieve significant memory savings, more than 93% in our example workload, while exploring its implementation, technical details, and practical recommendations.
13+
14+
## Introduction
15+
16+
Bloom filters are a space efficient probabilistic data structure that supports adding elements and checking whether elements were previously added. False positives are possible, where a filter incorrectly indicates that an element exists even though it was not added.
17+
However, Bloom Filters guarantee that false negatives do not occur, meaning that an element that was added successfully will never be reported as not existing.
18+
19+
![Bloom Filter Bit Vector](/assets/media/pictures/bloomfilter_bitvector.png)
20+
*Image taken from [source](https://en.wikipedia.org/wiki/Bloom_filter#/media/File:Bloom_filter.svg)*
21+
22+
When adding an item to a bloom filter, K different hash functions compute K corresponding bits from the bit vector, which are then set to 1.
23+
Checking existence involves the same hash functions - if any bit is 0, the item is definitely absent; if all bits are 1, the item likely exists (with a defined false positive probability).
24+
This bit-based approach, rather than full item allocation, makes bloom filters very space efficient with the trade off being potential false positives.
25+
26+
Valkey-Bloom introduces bloom filters as a new data type to Valkey, providing both scalable and non-scalable variants.
27+
It is API compatible with the bloom filter command syntax of the official Valkey client libraries including valkey-py, valkey-java, valkey-go (as well as the equivalent Redis libraries).
28+
29+
## Data type overview
30+
31+
The "Bloom Object" is the main bloom data type structure. This is what gets created with any bloom filter creation command and this structure can act either as a "scaling bloom filter" or "non scaling bloom filter" depending on the user configuration.
32+
It consists of a vector of "Sub Filters" with length >= 1 in case of scaling and only 1 in case of non scaling.
33+
34+
The "Sub Filter" is an inner structure which is created and used within the "Bloom Object". It tracks the capacity, number of items added, and an instance of a Bloom Filter (of the specified properties).
35+
36+
![bloom filter data type](/assets/media/pictures/bloomfilter_datatype.png)
37+
38+
**Non Scaling**
39+
40+
When non-scaling filters reach their capacity, if a user tries to add a new/unique item, an error is returned.
41+
You can create a non scaling bloom filter using `BF.RESERVE` or `BF.INSERT` commands.
42+
Example:
43+
```
44+
BF.RESERVE <filter-name> <error-rate> <capacity> NONSCALING
45+
```
46+
47+
**Scaling**
48+
49+
When scaling filters reach their capacity, if a user adds an item to the bloom filter, a new sub filter is created and added to the vector of sub filters.
50+
This new bloom sub filter will have a larger capacity (previous_bloomfilter_capacity * expansion_rate of the bloom filter).
51+
When checking whether an item exists on a scaled out bloom filter (`BF.EXISTS`/`BF.MEXISTS`), we look through each filter (from oldest to newest) in the sub filter vector and perform a check operation on each one.
52+
Similarly, to add a new item to the bloom filter, we check through all the filters to see if the item already exists and the item is added to the current filter if it does not exist.
53+
Any default creation as a result of `BF.ADD`, `BF.MADD`, `BF.INSERT` will be a scalable bloom filter.
54+
55+
**Common Bloom filter properties**
56+
57+
Capacity - The number of unique items that can be added before a bloom filter scales out occurs (in case of scalable bloom filters) or before any command which inserts a new item will return an error (in case of non scalable bloom filters).
58+
59+
False Positive Rate (FP) - The rate that controls the probability of item add/exists operations being false positives. Example: 0.001 means 1 in every 1000 operations can be a false positive.
60+
61+
## Use cases / Memory Savings
62+
63+
In this example, we are simulating a very common use case of bloom filters: Advertisement Deduplication. Applications can utilize bloom filters to track whether an advertisement / promotion has already been shown to a customer and use this to prevent showing it again to the customer.
64+
65+
Let us assume we have 500 unique advertisements and our service has 5M customers. Both advertisements and customers are identified by a UUID (36 characters).
66+
67+
Without bloom filters, applications could use the `SET` Valkey data type such that they have a unique `SET` for every advertisement.
68+
Then, they can use the `SADD` command to track every customer who has already seen this particular advertisement by adding them to the set.
69+
To check if a customer has seen the ad, the `SISMEMBER` or `SMISMEMBER` command can be used. This means we have 500 sets, each with 5M members. This will require ~152.57 GB of `used_memory` on a Valkey 8.0 server.
70+
71+
With bloom filters, applications can create a unique bloom filter for every advertisement with the `BF.RESERVE` or `BF.INSERT` command.
72+
Here, they can specify the exact capacity they require: 5M - which means 5M items can be added to the bloom filter. For every customer that the advertisement is shown to, the application can add the UUID of the customer onto the specific filter.
73+
To check if a customer has seen the ad, the `BF.EXISTS` or `BF.MEXISTS` command can be used. So, we have 500 bloom filters, each with a capacity of 5M.
74+
This will require variable memory depending on the false positive rate. In all cases (even stricter false positive rates), we can see there is a significant memory optimization compared to using the `SET` data type.
75+
76+
| Number of Bloom Filters | Capacity | FP Rate | FP Rate Description | Total Used Memory (GB) | Memory Saved % compared to SETS |
77+
|-------------------------|----------|---------|----------------------|------------------------|----------------------------------|
78+
| 500 | 5000000 | 0.01 | One in every 100 | 2.9 | **98.08%** |
79+
| 500 | 5000000 | 0.001 | One in every 1K | 4.9 | **96.80%** |
80+
| 500 | 5000000 | 0.00001 | One in every 100K | 7.8 | **94.88%** |
81+
| 500 | 5000000 | 0.0000002| One in every 5M | 9.8 | **93.60%** |
82+
83+
In this example, we are able to benefit from 93% - 98% savings in memory usage when using Bloom Filters compared to the `SET` data type. Depending on your workload, you can expect similar results.
84+
85+
![SET vs Bloom Filter Memory Usage Comparison](/assets/media/pictures/bloomfilter_memusage.png)
86+
87+
## Large Bloom Filters and Recommendations
88+
89+
To improve server performance during serialization and deserialization of bloom filters, we have added validation on the memory usage per object.
90+
The default memory usage limit of a bloom filter is defined by the `BF.BLOOM-MEMORY-USAGE-LIMIT` configuration which has a default value of 128 MB.
91+
However, the value can be tuned using the configuration above.
92+
93+
The implication of the memory limit is that operations involving bloom filter creations or scaling out, that result in a bloom filter with overall memory usage over the limit, will return an error. Example:
94+
```
95+
127.0.0.1:6379> BF.ADD ad1_filter user1
96+
(error) ERR operation exceeds bloom object memory limit
97+
```
98+
This poses an issue to users where their scalable bloom filters can reach the memory limit after some number of days of data population and it starts failing scale outs during the insertion of unique items.
99+
100+
As a solution, to help users understand at what capacity their bloom filter will hit the memory limit, valkey-bloom has two options.
101+
These are useful to check beforehand to ensure that your bloom filter will not fail scale outs or creations later on as part of your workload.
102+
103+
1. Perform a memory check prior to bloom filter creation
104+
105+
We can use the `VALIDATESCALETO` option of the `BF.INSERT` command to perform a validation whether the filter is within the memory limit.
106+
If it is not within the limits, the command will return an error. In the example below, we see that filter1 cannot scale out and reach the capacity of 26214301 due to the memory limit. However, it can scale out and reach a capacity of 26214300.
107+
```
108+
127.0.0.1:6379> BF.INSERT filter1 VALIDATESCALETO 26214301
109+
(error) ERR provided VALIDATESCALETO causes bloom object to exceed memory limit
110+
127.0.0.1:6379> BF.INSERT filter1 VALIDATESCALETO 26214300 ITEMS item1
111+
1) (integer) 1
112+
```
113+
2. Check the maximum capacity that an existing scalable bloom filter can expand to
114+
115+
We can use the `BF.INFO` command to find out the maximum capacity that the scalable bloom filter can expand to hold. In this case, we can see the filter can hold 26214300 items (after scaling out until the memory limit).
116+
```
117+
127.0.0.1:6379> BF.INFO filter1 MAXSCALEDCAPACITY
118+
(integer) 26214300
119+
```
120+
121+
To get an idea of what the memory usage looks like for the max capacity of an individual non scaling filter, we have a table below.
122+
With a 128MB limit and default false positive rate, we can create a bloom filter with 112M as the capacity. With a 512MB limit, a bloom filter can hold 448M items.
123+
124+
| Non Scaling Filter - Capacity | FP Rate | Memory Usage (MB) | Notes |
125+
|------------------------------|---------|-------------------|------------------------------------------|
126+
| 112M | 0.01 | ~128 | Default FP Rate and Default Memory Limit |
127+
| 74M | 0.001 | ~128 | Custom FP Rate and Default Memory Limit |
128+
| 448M | 0.01 | ~512 | Default FP Rate and Custom Memory Limit |
129+
| 298M | 0.001 | ~512 | Custom FP Rate and Custom Memory Limit |
130+
131+
## Performance
132+
133+
The bloom commands which involve adding items or checking the existence of items have a time complexity of O(N * K) where N is the number of elements being inserted and K is the number of hash functions used by the bloom filter.
134+
This means that both `BF.ADD` and `BF.EXISTS` are both O(K) as they only operate on one item.
135+
136+
In scalable bloom filters, we increase the number of hash function based checks during add/exists operations with each scale out; Each sub filter requires at least one hash function and this number increases as the false positive rate becomes stricter with scale outs due to the [tightening ratio](https://valkey.io/topics/bloomfilters/#advanced-properties).
137+
For this reason, it is recommended that users choose a capacity and expansion rate after evaluating the use case / workload to avoid several scale outs and reduce the number of checks.
138+
139+
Example: For a bloom filter to achieve an overall capacity of 10M with a starting capacity of 100K and expansion rate of 1, it will require 100 sub filters (after 99 scale outs).
140+
Instead, with the same starting capacity of 100K and expansion rate of 2, a bloom filter can achieve an overall capacity of ~12.7M with just 7 sub filters.
141+
Alternatively, with the same expansion rate of 1 and starting capacity of 1M, a bloom filter can achieve an overall capacity of 10M with 10 sub filters.
142+
Both approaches significantly reduce the number of checks per item add / exists operation.
143+
144+
The other bloom filter commands are O(1) time complexity: `BF.CARD`, `BF.INFO`, `BF.RESERVE`, and `BF.INSERT` (when no items are provided).
145+
146+
## Conclusion
147+
148+
valkey-bloom offers an efficient solution for high-volume membership testing through bloom filters, providing significant memory usage savings compared to traditional data types.
149+
This enhances Valkey's capability to handle various workloads including large-scale advertisement / event deduplication, fraud detection, and reducing disk / backend lookups more efficiently.
150+
151+
To learn more about [valkey-bloom](https://github.com/valkey-io/valkey-bloom/), you can read about the data type [here](https://valkey.io/topics/bloomfilters/) and follow the [quick start guide](https://github.com/valkey-io/valkey-bloom/blob/1.0.0/QUICK_START.md) to try it yourself.
152+
Additionally, to use valkey-bloom on Docker (along with other official modules), you can check out the [Valkey Extensions Docker Image](https://hub.docker.com/r/valkey/valkey-extension).
153+
154+
Thank you to all those who helped develop the module:
155+
* Karthik Subbarao ([KarthikSubbarao](https://github.com/KarthikSubbarao))
156+
* Cameron Zack ([zackcam](https://github.com/zackcam))
157+
* Vanessa Tang ([YueTang-Vanessa](https://github.com/YueTang-Vanessa))
158+
* Nihal Mehta ([nnmehta](https://github.com/nnmehta))
159+
* wuranxx ([wuranxx](https://github.com/wuranxx))

sass/css/styles.scss

+24-1
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,27 @@
66
@import '../colors';
77
@import '../typography';
88
@import '../pygments';
9-
@import '../valkey';
9+
@import '../valkey';
10+
11+
/* Styling for Markdown Tables */
12+
table {
13+
width: 100%;
14+
border-collapse: collapse;
15+
margin: 1em 0;
16+
}
17+
18+
th {
19+
background-color: #123678;
20+
color: white;
21+
padding: 8px;
22+
text-align: left;
23+
}
24+
25+
td {
26+
border: 1px solid #ddd;
27+
padding: 8px;
28+
}
29+
30+
tr:nth-child(even) {
31+
background-color: #f2f2f2;
32+
}
58.6 KB
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)