Skip to content

Commit 69db9f8

Browse files
committed
bloomfilter-blocked: updated documentation
This includes * A new README * Moved "examples" and "differences" haddock sections to the top-level module. * Make examples executable using `cabal-docspec` * Add `BloomPolicy` and `BloomSize` types for the blocked bloom filter instead of reusing the types of the same name from the classic bloom filter. This adds a bit of boilerplate, but it makes the documentation clearer because the hyperlinks were pointing from the blocked modules to the classic modules before. * Add `read` and `readHashes` to the blocked bloom filter, which the classic bloom filter already had implemented.
1 parent 6641ff1 commit 69db9f8

File tree

11 files changed

+360
-161
lines changed

11 files changed

+360
-161
lines changed

bloomfilter-blocked/README.md

Lines changed: 62 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,75 @@
1-
# A fast, space efficient Bloom filter implementation
1+
# bloomfilter-blocked
22

3-
Copyright 2008, 2009, 2010, 2011 Bryan O'Sullivan <[email protected]>.
3+
`bloomfilter-blocked` is a Haskell library providing multiple fast and efficient
4+
implementations of [bloom filters][bloom-filter:wiki]. It is a full rewrite of
5+
the [`bloomfilter`][bloomfilter:hackage] package, originally authored by Bryan
6+
O'Sullivan <[email protected]>.
47

5-
This package provides both mutable and immutable Bloom filter data
6-
types, along with a family of hash function and an easy-to-use
7-
interface.
8+
A bloom filter is a space-efficient data structure representing a set that can
9+
be probablistically queried for set membership. The set membership query returns
10+
no false negatives, but it might return false positives. That is, if an element
11+
was added to a bloom filter, then a subsequent query definitely returns `True`.
12+
If an element was *not* added to a filter, then a subsequent query may still
13+
return `True` if `False` would be the correct answer. The probabiliy of false
14+
positives -- the false positive rate (FPR) -- is configurable, as we will
15+
describe later.
816

9-
To build:
17+
The library includes two implementations of bloom filters: classic, and blocked.
1018

11-
cabal install bloomfilter
19+
* **Classic** bloom filters, found in the `Data.BloomFilter.Classic` module: a
20+
default implementation that is faithful to the canonical description of a
21+
bloom filter data structure.
1222

13-
For examples of usage, see the Haddock documentation and the files in
14-
the examples directory.
23+
* **Blocked** floom filters, found in the `Data.BloomFilter.Blocked` module: an
24+
implementation that optimises the memory layout of a classic bloom filter for
25+
speed (cheaper CPU cache reads), at the cost of a slightly higher FPR for the
26+
same amount of assigned memory.
1527

28+
The FPR scales inversely with how much memory is assigned to the filter. It also
29+
scales inversely with how many elements are added to the set. The user can
30+
configure how much memory is asisgned to a filter, and the user also controls
31+
how many elements are added to a set. Each implementation comes with helper
32+
functions, like `sizeForFPR` and `sizeForBits`, that the user can leverage to
33+
configure filters.
1634

17-
# Get involved!
35+
Both immutable (`Bloom`) and mutable (`MBloom`) bloom filters, including
36+
functions to convert between the two, are provided for each implementation. Note
37+
however that a (mutable) bloom filter can not be resized once created, and that
38+
elements can not be deleted once inserted.
1839

19-
Please report bugs via the
20-
[github issue tracker](https://github.com/haskell-pkg-janitors/bloomfilter).
40+
For more information about the library and examples of how to use it, see the
41+
Haddock documentation of the different modules.
2142

22-
Master [git repository](https://github.com/haskell-pkg-janitors/bloomfilter):
43+
# Usage notes
2344

24-
* `git clone git://github.com/haskell-pkg-janitors/bloomfilter.git`
45+
User should take into account the following:
2546

47+
* This package is not supported on 32bit systems.
2648

27-
# Authors
49+
# Differences from the `bloomfilter` package
2850

29-
This library is written by Bryan O'Sullivan, <[email protected]>.
51+
The library is a full rewrite of the [`bloomfilter`][bloomfilter:hackage]
52+
package, originally authored by Bryan O'Sullivan <[email protected]>. The main
53+
differences are:
54+
55+
* `bloomfilter-blocked` supports both classic and blocked bloom filters, whereas
56+
`bloomfilter` only supports the former.
57+
* `bloomfilter-blocked` supports bloom filters of arbitrary sizes, whereas
58+
`bloomfilter` limits the sizes to powers of two.
59+
* `bloomfilter-blocked` supports sizes up to `2^48` for classic bloom filters
60+
and up to `2^41` for blocked bloom filters, instead of `2^32`.
61+
* In `bloomfilter-blocked`, the `Bloom` and `MBloom` types are parameterised
62+
over a `Hashable` type class, instead of having a `a -> [Hash]` typed field.
63+
This separation in `bloomfilter-blocked` allows clean (de-)serialisation of
64+
filters as the hashing scheme is static.
65+
* `bloomfilter-blocked` uses [`XXH3`][xxh3] for hashing instead of [Jenkins'
66+
`lookup3`][lookup3:wiki], which `bloomfilter` uses.
67+
* The user can configure hash salts for improved security in
68+
`bloomfilter-blocked`, whereas this is not supported in `bloomfilter`.
69+
70+
<!-- Sources -->
71+
72+
[bloom-filter:wiki]: https://en.wikipedia.org/wiki/Bloom_filter
73+
[bloomfilter:hackage]: https://hackage.haskell.org/package/bloomfilter
74+
[xxh3]: https://xxhash.com/
75+
[lookup3:wiki]: https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,63 @@
1+
-- | By default, this module re-exports the classic bloom filter implementation
2+
-- from "Data.BloomFilter.Classic". If you want to use the blocked bloom filter
3+
-- implementation, import "Data.BloomFilter.Blocked".
14
module Data.BloomFilter (
25
module Data.BloomFilter.Classic
6+
-- * Example: a spelling checker
7+
-- $example
8+
9+
-- * Differences with the @bloomfilter@ package
10+
-- $differences
311
) where
412

513
import Data.BloomFilter.Classic
14+
15+
-- $example
16+
--
17+
-- This example reads a dictionary file containing one word per line,
18+
-- constructs a Bloom filter with a 1% false positive rate, and
19+
-- spellchecks its standard input. Like the Unix @spell@ command, it
20+
-- prints each word that it does not recognize.
21+
--
22+
-- >>> import Control.Monad (forM_)
23+
-- >>> import System.Environment (getArgs)
24+
-- >>> import qualified Data.BloomFilter as B
25+
--
26+
-- >>> :{
27+
-- main :: IO ()
28+
-- main = do
29+
-- files <- getArgs
30+
-- dictionary <- readFile "/usr/share/dict/words"
31+
-- let !bloom = B.fromList (B.policyForFPR 0.01) 4 (words dictionary)
32+
-- forM_ files $ \file ->
33+
-- putStrLn . unlines . filter (`B.notElem` bloom) . words
34+
-- =<< readFile file
35+
-- :}
36+
37+
-- $differences
38+
--
39+
-- This package is an entirely rewritten fork of the
40+
-- [bloomfilter](https://hackage.haskell.org/package/bloomfilter) package.
41+
--
42+
-- The main differences are
43+
--
44+
-- * Support for both classic and \"blocked\" Bloom filters. Blocked-structured
45+
-- Bloom filters arrange all the bits for each insert or lookup into a single
46+
-- cache line, which greatly reduces the number of slow uncached memory reads.
47+
-- The trade-off for this performance optimisation is a slightly worse
48+
-- trade-off between bits per element and the FPR. In practice for typical
49+
-- FPRs of @1-e3@ up to @1e-4@, this requires a couple extra bits per element.
50+
--
51+
-- * This package support Bloom filters of arbitrary sizes (not limited to powers
52+
-- of two).
53+
--
54+
-- * Sizes over @2^32@ are supported up to @2^48@ for classic Bloom filters and
55+
-- @2^41@ for blocked Bloom filters.
56+
--
57+
-- * The 'Bloom' and 'MBloom' types are parametrised over a 'Hashable' type
58+
-- class, instead of having a @a -> ['Hash']@ typed field.
59+
-- This separation allows clean (de-)serialisation of Bloom filters in this
60+
-- package, as the hashing scheme is static.
61+
--
62+
-- * [@XXH3@ hash](https://xxhash.com/) is used instead of [Jenkins'
63+
-- @lookup3@](https://en.wikipedia.org/wiki/Jenkins_hash_function#lookup3).

bloomfilter-blocked/src/Data/BloomFilter/Blocked.hs

Lines changed: 67 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,17 @@
1-
-- |
1+
-- | A fast, space efficient Bloom filter implementation. A Bloom filter is a
2+
-- set-like data structure that provides a probabilistic membership test.
23
--
3-
-- A fast, space efficient Bloom filter implementation. A Bloom
4-
-- filter is a set-like data structure that provides a probabilistic
5-
-- membership test.
4+
-- * Queries do not give false negatives. When an element is added to a filter,
5+
-- a subsequent membership test will definitely return 'True'.
66
--
7-
-- * Queries do not give false negatives. When an element is added to
8-
-- a filter, a subsequent membership test will definitely return
9-
-- 'True'.
7+
-- * False positives /are/ possible. If an element has not been added to a
8+
-- filter, a membership test /may/ nevertheless indicate that the element is
9+
-- present.
1010
--
11-
-- * False positives /are/ possible. If an element has not been added
12-
-- to a filter, a membership test /may/ nevertheless indicate that
13-
-- the element is present.
14-
--
15-
1611
module Data.BloomFilter.Blocked (
12+
-- * Overview
13+
-- $overview
14+
1715
-- * Types
1816
Hash,
1917
Salt,
@@ -57,6 +55,7 @@ module Data.BloomFilter.Blocked (
5755
maxSizeBits,
5856
insert,
5957
insertMany,
58+
read,
6059

6160
-- ** Conversion
6261
freeze,
@@ -68,6 +67,7 @@ module Data.BloomFilter.Blocked (
6867
hashesWithSalt,
6968
insertHashes,
7069
elemHashes,
70+
readHashes,
7171
-- ** Prefetching
7272
prefetchInsert,
7373
prefetchElem,
@@ -80,23 +80,60 @@ import Data.Bits ((.&.))
8080
import Data.Primitive.ByteArray (MutableByteArray)
8181
import qualified Data.Primitive.PrimArray as P
8282

83-
import Data.BloomFilter.Blocked.Calc
83+
import Data.BloomFilter.Blocked.Calc (BitsPerEntry, BloomPolicy (..),
84+
BloomSize (..), FPR, NumEntries, policyFPR, policyForBits,
85+
policyForFPR, sizeForBits, sizeForFPR, sizeForPolicy)
8486
import Data.BloomFilter.Blocked.Internal hiding (deserialise)
8587
import qualified Data.BloomFilter.Blocked.Internal as Internal
8688
import Data.BloomFilter.Hash
8789

88-
import Prelude hiding (elem, notElem)
90+
import Prelude hiding (elem, notElem, read)
91+
92+
-- $setup
93+
--
94+
-- >>> import Text.Printf
95+
96+
-- $overview
97+
--
98+
-- Each of the functions for creating Bloom filters accepts a 'BloomSize'. The
99+
-- size determines the number of bits that should be used for the filter. Note
100+
-- that a filter is fixed in size; it cannot be resized after creation.
101+
--
102+
-- The size can be specified by asking for a target false positive rate (FPR)
103+
-- or a number of bits per element, and the number of elements in the filter.
104+
-- For example:
105+
--
106+
-- * @'sizeForFPR' 1e-3 10_000@ for a Bloom filter sized for 10,000 elements
107+
-- with a false positive rate of 1 in 1000
108+
--
109+
-- * @'sizeForBits' 10 10_000@ for a Bloom filter sized for 10,000 elements
110+
-- with 10 bits per element
111+
--
112+
-- Depending on the application it may be more important to target a fixed
113+
-- amount of memory to use, or target a specific FPR.
114+
--
115+
-- As a very rough guide for filter sizes, here are a range of FPRs and bits
116+
-- per element:
117+
--
118+
-- * FPR of 1e-1 requires approximately 4.8 bits per element
119+
-- * FPR of 1e-2 requires approximately 9.8 bits per element
120+
-- * FPR of 1e-3 requires approximately 15.8 bits per element
121+
-- * FPR of 1e-4 requires approximately 22.6 bits per element
122+
-- * FPR of 1e-5 requires approximately 30.2 bits per element
123+
--
124+
-- >>> fmap (printf "%0.1f" . policyBits . policyForFPR) [1e-1, 1e-2, 1e-3, 1e-4, 1e-5] :: [String]
125+
-- ["4.8","9.8","15.8","22.6","30.2"]
89126

90127
-- | Create an immutable Bloom filter, using the given setup function
91128
-- which executes in the 'ST' monad.
92129
--
93130
-- Example:
94131
--
95-
-- @
132+
-- >>> :{
96133
-- filter = create (sizeForBits 16 2) 4 $ \mf -> do
97-
-- insert mf \"foo\"
98-
-- insert mf \"bar\"
99-
-- @
134+
-- insert mf "foo"
135+
-- insert mf "bar"
136+
-- :}
100137
--
101138
-- Note that the result of the setup function is not used.
102139
create :: BloomSize
@@ -141,6 +178,12 @@ elem = \ !x !b -> elemHashes b (hashesWithSalt (hashSalt b) x)
141178
notElem :: Hashable a => a -> Bloom a -> Bool
142179
notElem = \x b -> not (x `elem` b)
143180

181+
-- | Query a mutable Bloom filter for membership. If the value is
182+
-- present, return @True@. If the value is not present, there is
183+
-- /still/ some possibility that @True@ will be returned.
184+
read :: Hashable a => MBloom s a -> a -> ST s Bool
185+
read !mb !x = readHashes mb (hashesWithSalt (mbHashSalt mb) x)
186+
144187
-- | Build an immutable Bloom filter from a seed value. The seeding
145188
-- function populates the filter as follows.
146189
--
@@ -168,6 +211,7 @@ unfold bloomsize bloomsalt f k =
168211
Nothing -> pure ()
169212
Just (a, j') -> insert mb a >> loop j'
170213

214+
{-# INLINEABLE fromList #-}
171215
-- | Create a Bloom filter, populating it from a sequence of values.
172216
--
173217
-- For example
@@ -185,10 +229,11 @@ fromList policy bloomsalt xs =
185229
where
186230
bsize = sizeForPolicy policy (length xs)
187231

188-
{-# SPECIALISE deserialise :: BloomSize
189-
-> Salt
190-
-> (MutableByteArray RealWorld -> Int -> Int -> IO ())
191-
-> IO (Bloom a) #-}
232+
{-# SPECIALISE deserialise ::
233+
BloomSize
234+
-> Salt
235+
-> (MutableByteArray RealWorld -> Int -> Int -> IO ())
236+
-> IO (Bloom a) #-}
192237
deserialise :: PrimMonad m
193238
=> BloomSize
194239
-> Salt

bloomfilter-blocked/src/Data/BloomFilter/Blocked/BitArray.hs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ module Data.BloomFilter.Blocked.BitArray (
1818
new,
1919
unsafeSet,
2020
prefetchSet,
21+
unsafeRead,
2122
freeze,
2223
unsafeFreeze,
2324
thaw,
@@ -155,6 +156,17 @@ prefetchSet (MBitArray (MutablePrimArray mba#)) (BlockIx blockIx) = do
155156
ST (\s -> case prefetchMutableByteArray0# mba# i# s of
156157
s' -> (# s', () #))
157158

159+
unsafeRead :: MBitArray s -> BlockIx -> BitIx -> ST s Bool
160+
unsafeRead (MBitArray arr) blockIx blockBitIx = do
161+
#ifdef NO_IGNORE_ASSERTS
162+
sz <- getSizeofMutablePrimArray arr
163+
assert (wordIx >= 0 && wordIx < sz) $ pure ()
164+
#endif
165+
w <- readPrimArray arr wordIx
166+
pure $ unsafeTestBit w wordBitIx
167+
where
168+
(wordIx, wordBitIx) = wordAndBitIndex blockIx blockBitIx
169+
158170
freeze :: MBitArray s -> ST s BitArray
159171
freeze (MBitArray arr) = do
160172
len <- getSizeofMutablePrimArray arr

bloomfilter-blocked/src/Data/BloomFilter/Blocked/Calc.hs

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,7 @@ module Data.BloomFilter.Blocked.Calc (
1313
policyForBits,
1414
) where
1515

16-
import Data.BloomFilter.Classic.Calc (BitsPerEntry, BloomPolicy (..),
17-
BloomSize (..), FPR, NumEntries)
16+
import Data.BloomFilter.Classic.Calc (BitsPerEntry, FPR, NumEntries)
1817

1918
{-
2019
Calculating the relationship between bits and FPR for the blocked
@@ -49,6 +48,32 @@ Fit {
4948
5049
-}
5150

51+
-- | A policy on intended bloom filter size -- independent of the number of
52+
-- elements.
53+
--
54+
-- We can decide a policy based on:
55+
--
56+
-- 1. a target false positive rate (FPR) using 'policyForFPR'
57+
-- 2. a number of bits per entry using 'policyForBits'
58+
--
59+
-- A policy can be turned into a 'BloomSize' given a target 'NumEntries' using
60+
-- 'sizeForPolicy'.
61+
--
62+
-- Either way we define the policy, we can inspect the result to see:
63+
--
64+
-- 1. The bits per entry 'policyBits'. This will determine the
65+
-- size of the bloom filter in bits. In general the bits per entry can be
66+
-- fractional. The final bloom filter size in will be rounded to a whole
67+
-- number of bits.
68+
-- 2. The number of hashes 'policyHashes'.
69+
-- 3. The expected FPR for the policy using 'policyFPR'.
70+
--
71+
data BloomPolicy = BloomPolicy {
72+
policyBits :: !Double,
73+
policyHashes :: !Int
74+
}
75+
deriving stock Show
76+
5277
policyForFPR :: FPR -> BloomPolicy
5378
policyForFPR fpr | fpr <= 0 || fpr >= 1 =
5479
error "bloomPolicyForFPR: fpr out of range (0,1)"
@@ -103,6 +128,19 @@ policyFPR BloomPolicy {
103128
f1 = 0.5251544487138062
104129
f0 = -0.10110451821280719
105130

131+
-- | Parameters for constructing a Bloom filter.
132+
--
133+
data BloomSize = BloomSize {
134+
-- | The requested number of bits in the filter.
135+
--
136+
-- The actual size will be rounded up to the nearest 512.
137+
sizeBits :: !Int,
138+
139+
-- | The number of hash functions to use.
140+
sizeHashes :: !Int
141+
}
142+
deriving stock Show
143+
106144
sizeForFPR :: FPR -> NumEntries -> BloomSize
107145
sizeForFPR = sizeForPolicy . policyForFPR
108146

0 commit comments

Comments
 (0)