HLL++ in sparse mode can be large than in normal mode

Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators . 

This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.

However I don't observe a such behavior with HyperLogLogPlus:

HyperLogLog:

```
        HyperLogLog hll = new HyperLogLog(14);
        System.out.println(hll.getBytes().length);

=> 10932
```

HyperLogLogPlus in normal mode

```
        HyperLogLogPlus hllp = new HyperLogLogPlus(14);
        System.out.println(hllp.getBytes().length);

=> 10940 
```

Empty HyperLogLogPlus in sparse mode 

```
        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
        System.out.println(hllp.getBytes().length);

=> 16
```

5K elements with HyperLogLogPlus in sparse mode

```
        Random r = new Random();

        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);

        for (int i = 0; i < 5000; i++) {
            hllp.offer(Integer.toString(r.nextInt()));
        }

        System.out.println(hllp.getBytes().length);

=> 25495
```

According to the source code the `sparseSetThreshold` only depends of `p` and is set to 12288 for `p = 14`. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.

Am I wrong ? Is this behavior expected ? 

My second question would be: Do we really want to create the `RegisterSet` even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me  since my bottleneck is the serialization size but I find this quite surprising.  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HLL++ in sparse mode can be large than in normal mode #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HLL++ in sparse mode can be large than in normal mode #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions