Frequently asked questions

Table of contents
- How can I reproduce the experimental results in your paper? - Why doesn't Giddy support compression scheme [X]? - Are the performance figures you present in your paper supposed to be optimal for these schemes? - In your published results, Why is the speedup using scheme [Y] so poor? - I know how transmit then decompress. How do I transmit and decompress simultaneously?

(The questions are answered in the first person while I - Eyal - am the only developer/maintainer.)

How can I reproduce the experimental results in your paper?

Ah, good question. For now here's a sketchy description; later on I'll flesh things out.

You will be using the following software packages/repositories:

MonetDB Jun2016 SP2 (that's a link to the sources; there are also binary packages for various operating systems. It has to be this version, since the
My TPCH benchmark tools.
My USDT-Ontime benchmark tools.
My DB kernel testbench - which also includes the Giddy library kernels. (Yes, it's on BitBucket.)

After (building and) installing MonetDB, use the setup scripts in both of the benchmark tools packages to generate/download the data set ; the main scripts are scripts/setup-tpch-db and scripts/setup-usdt-ontime-db respectively (run them with --help first). The you'll need to build the kernel testbench; run it with --help to list the command-line options, and use the command line options which tell it to use one of the decompression kernels and take its input from one of the two databases you set up. Now, that will only run the kernel, not give you timing results. For those you'll need the scripts/time_column_decompression script; it will take its own options, then a --, then the tester binary options - and will produce a 2-line CSV (header + data) with timing information.

Again, I realize this is rather sketchy. Please contact me if you get stuck in any of the steps I described.

Note: You can't use another version of MonetDB, since my kernel testbench (or rather, its MonetDB persisted storage reader library) only supports that version, for now.

Why doesn't Giddy support compression scheme [X]?

The possible reasons are:

I haven't had the time to implement X and nobody has asked me to prioritize it.
I've has never heard of X before (I'm not a compression researcher, this is a tool for me to do other things).
I've reached the conclusion X would not be useful on a GPU, i.e. it would be slow to decompress.
It's not clear to me (at the moment) that X would provide a significant benefit considering the schemes already implemented.

and they are in decreasing order of likelyhood!

Are the performance figures you present in your paper supposed to be optimal for these schemes?

Generally speaking, no.

For some schemes, the speed is not very far (in multiplicative terms) from the theoretical optimum (see question about that), and I'm not sure whether they can be practically improved further (though I could be wrong).

For other schemes - including but not limited to RLE, RPE and BITMAP - I'm sure that further improvement is possible, for many or even all input distributions.

In your published results, Why is the speedup using scheme [Y] so poor?

It's either a known issue which has not bee resolved yet (which could be a bug or inefficiency of some kind), or it's the GPU hardware which limits the decompression speed.

If you're looking at performance in terms of total time for transmission and decompression, remember that the compression ratio determines a lot, regardless of whether decompression is fast or slow

I know how transmit then decompress. How do I transmit and decompress simultaneously?

That is not a capability that the library itself provides. The point is to transmit some chunks or segments of the data while decompressing others; GPUs do not support a proper pipeline in which data coming in on the PCIe bus immediately goes into processing cores rather than through global device memory.

So, what you do is one of the following (or both of them):

Schedule the transfer of multiple compressed columns on one queue (= CUDA stream), and their decompression on another queue, with event dependencies between the end of transmission and the beginning of decompression for each column.
Break up your columns into sequences of contiguous segments, and schedule transmission-then-decompression of such sequences on multiple queues; this way, when on one queue we're waiting for decompression of a chunk to conclude, the GPU can schedule more transfers on another queue when previous transfers are done. The chunks should be not too large, as the larger they are the more initial latency we have before the first chunk gets through and decompression begins, and not too small, since as the size decreases, the overheads of executing a kernel increases.

Provide feedback

Saved searches