-
Notifications
You must be signed in to change notification settings - Fork 559
HLL++ in sparse mode can be large than in normal mode #38
Description
Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .
This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.
However I don't observe a such behavior with HyperLogLogPlus:
HyperLogLog:
HyperLogLog hll = new HyperLogLog(14);
System.out.println(hll.getBytes().length);
=> 10932
HyperLogLogPlus in normal mode
HyperLogLogPlus hllp = new HyperLogLogPlus(14);
System.out.println(hllp.getBytes().length);
=> 10940
Empty HyperLogLogPlus in sparse mode
HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
System.out.println(hllp.getBytes().length);
=> 16
5K elements with HyperLogLogPlus in sparse mode
Random r = new Random();
HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
for (int i = 0; i < 5000; i++) {
hllp.offer(Integer.toString(r.nextInt()));
}
System.out.println(hllp.getBytes().length);
=> 25495
According to the source code the sparseSetThreshold only depends of p and is set to 12288 for p = 14. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.
Am I wrong ? Is this behavior expected ?
My second question would be: Do we really want to create the RegisterSet even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.