Incidence Bitmaps

Overview

The Incidence Bitmaps compression scheme is intended for data with a very small support: Just a few distinct values. For element of the column we keep as many bits as there are potential values; and the bit which is turned on (has value 1 rather than 0) is the value of that column element.

This scheme's shorthand designator is BITMAP.

Scheme parameters

Parameter	Aliases	Symbol used
Domain element type	Uncompressed type, uncompressed data type	τ_D
Domain size	Number of possible values, support size, number of bitmaps	k
Domain size	Number of possible values, support size	k
Bit container type		τ_c
Column length	Single bitmap length, uncompressed data length	n

Variants

Arbitrary domain?

If the domain (the set of possible uncompressed values) is exactly 0..k-1, then the index of the bitmap with the on-bit is also the value of the column element. This is the simpler decompression.
For an arbitrary (but fixed-width) domain of values, we use a dictionary - similar to the one in the Dictionary compression scheme, but hard-fused into the implementation of the BITMAP scheme. The decomrpession will not write the bitmap index i, but rather look up and write dictionary[i], to the output

Infer last domain value bit? (Unsupported)

For a domain of size k, one only needs incidence bits for k-1 of the values, as those determine with certainty the value of the last bit. This is especially relevant when k=3 or k=2, in which one could save a full third or half of the storage space.

For k values, store all k bitmapsd - this variant is what's implemented
For k values, store the first k-1 bitmaps - this is not supported

Memory layout

The bits of each bitmap are packed into elements of type τ_c - "bit containers" - arranged within a container from the least to the most significant bit.
- The last bit container may have some 'slack' bits if the bitmap length is not a multiple of the size-in-bits of τ_c; their value should be ignored.
The bitmaps are stored consecutively in memory, and all passed to the kernel using a single address.
Beginnings of bitmaps are aligned to a larger size than |τ_c|, typically to at least 32 bytes. There may be thus a gap between the last container of one bitmap and the first container of the next one.
For the arbitrary-domain variant: The dictionary entries are passed as their own array, with elements of type τ_D. There is no slack and no alignment gaps.

Links to the code

Notes

If we were to "weave together" the bitmaps, we would get a unary index into the dictionary. This illustrates how the BITMAP quickly becomes inefficient as the domain size increases - using about n*k / 8 bytes of storage instead of n * log(k)/8 with bit-resolution DICT and at most n * (1 + log(k)/8) with byte-resolution DICT.
Despite the inefficiency, for small values of k, this scheme allows for reasonable compression ratio, while making filtering or equi-joining with a constant very simple: One simply uses the appropriate single bitmap as the result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly