Skip to content

Incidence Bitmaps

Eyal Rozenberg edited this page May 3, 2017 · 1 revision

Overview

The Incidence Bitmaps compression scheme is intended for data with a very small support: Just a few distinct values. For element of the column we keep as many bits as there are potential values; and the bit which is turned on (has value 1 rather than 0) is the value of that column element.

This scheme's shorthand designator is BITMAP.

Scheme parameters

Parameter Aliases Symbol used
Domain element type Uncompressed type, uncompressed data type τ_D
Domain size Number of possible values, support size, number of bitmaps k
Domain size Number of possible values, support size k
Bit container type τ_c
Column length Single bitmap length, uncompressed data length n

Variants

Arbitrary domain?

  • If the domain (the set of possible uncompressed values) is exactly 0..k-1, then the index of the bitmap with the on-bit is also the value of the column element. This is the simpler decompression.
  • For an arbitrary (but fixed-width) domain of values, we use a dictionary - similar to the one in the Dictionary compression scheme, but hard-fused into the implementation of the BITMAP scheme. The decomrpession will not write the bitmap index i, but rather look up and write dictionary[i], to the output

Infer last domain value bit? (Unsupported)

For a domain of size k, one only needs incidence bits for k-1 of the values, as those determine with certainty the value of the last bit. This is especially relevant when k=3 or k=2, in which one could save a full third or half of the storage space.

  • For k values, store all k bitmapsd - this variant is what's implemented
  • For k values, store the first k-1 bitmaps - this is not supported

Memory layout

  • The bits of each bitmap are packed into elements of type τ_c - "bit containers" - arranged within a container from the least to the most significant bit.
    • The last bit container may have some 'slack' bits if the bitmap length is not a multiple of the size-in-bits of τ_c; their value should be ignored.
  • The bitmaps are stored consecutively in memory, and all passed to the kernel using a single address.
  • Beginnings of bitmaps are aligned to a larger size than |τ_c|, typically to at least 32 bytes. There may be thus a gap between the last container of one bitmap and the first container of the next one.
  • For the arbitrary-domain variant: The dictionary entries are passed as their own array, with elements of type τ_D. There is no slack and no alignment gaps.

Links to the code

Notes

  • If we were to "weave together" the bitmaps, we would get a unary index into the dictionary. This illustrates how the BITMAP quickly becomes inefficient as the domain size increases - using about n*k / 8 bytes of storage instead of n * log(k)/8 with bit-resolution DICT and at most n * (1 + log(k)/8) with byte-resolution DICT.
  • Despite the inefficiency, for small values of k, this scheme allows for reasonable compression ratio, while making filtering or equi-joining with a constant very simple: One simply uses the appropriate single bitmap as the result.