-
Notifications
You must be signed in to change notification settings - Fork 9
Incidence Bitmaps
Eyal Rozenberg edited this page May 3, 2017
·
1 revision
The Incidence Bitmaps compression scheme is intended for data with a very small support: Just a few distinct values. For element of the column we keep as many bits as there are potential values; and the bit which is turned on (has value 1 rather than 0) is the value of that column element.
This scheme's shorthand designator is BITMAP
.
Parameter | Aliases | Symbol used |
---|---|---|
Domain element type | Uncompressed type, uncompressed data type | τ_D |
Domain size | Number of possible values, support size, number of bitmaps | k |
Domain size | Number of possible values, support size | k |
Bit container type | τ_c | |
Column length | Single bitmap length, uncompressed data length | n |
- If the domain (the set of possible uncompressed values) is exactly 0..k-1, then the index of the bitmap with the on-bit is also the value of the column element. This is the simpler decompression.
- For an arbitrary (but fixed-width) domain of values, we use a dictionary - similar to the one in the Dictionary compression scheme, but hard-fused into the implementation of the BITMAP scheme. The decomrpession will not write the bitmap index
i
, but rather look up and writedictionary[i]
, to the output
For a domain of size k, one only needs incidence bits for k-1 of the values, as those determine with certainty the value of the last bit. This is especially relevant when k=3 or k=2, in which one could save a full third or half of the storage space.
- For k values, store all k bitmapsd - this variant is what's implemented
- For k values, store the first k-1 bitmaps - this is not supported
- The bits of each bitmap are packed into elements of type τ_c - "bit containers" - arranged within a container from the least to the most significant bit.
- The last bit container may have some 'slack' bits if the bitmap length is not a multiple of the size-in-bits of τ_c; their value should be ignored.
- The bitmaps are stored consecutively in memory, and all passed to the kernel using a single address.
- Beginnings of bitmaps are aligned to a larger size than |τ_c|, typically to at least 32 bytes. There may be thus a gap between the last container of one bitmap and the first container of the next one.
- For the arbitrary-domain variant: The dictionary entries are passed as their own array, with elements of type τ_D. There is no slack and no alignment gaps.
- If we were to "weave together" the bitmaps, we would get a unary index into the dictionary. This illustrates how the BITMAP quickly becomes inefficient as the domain size increases - using about n*k / 8 bytes of storage instead of n * log(k)/8 with bit-resolution DICT and at most n * (1 + log(k)/8) with byte-resolution DICT.
- Despite the inefficiency, for small values of k, this scheme allows for reasonable compression ratio, while making filtering or equi-joining with a constant very simple: One simply uses the appropriate single bitmap as the result.