Skip to content
This repository was archived by the owner on Jul 7, 2020. It is now read-only.

HLL++ in sparse mode can be large than in normal mode #38

@cykl

Description

@cykl

Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .

This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.

However I don't observe a such behavior with HyperLogLogPlus:

HyperLogLog:

        HyperLogLog hll = new HyperLogLog(14);
        System.out.println(hll.getBytes().length);

=> 10932

HyperLogLogPlus in normal mode

        HyperLogLogPlus hllp = new HyperLogLogPlus(14);
        System.out.println(hllp.getBytes().length);

=> 10940 

Empty HyperLogLogPlus in sparse mode

        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
        System.out.println(hllp.getBytes().length);

=> 16

5K elements with HyperLogLogPlus in sparse mode

        Random r = new Random();

        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);

        for (int i = 0; i < 5000; i++) {
            hllp.offer(Integer.toString(r.nextInt()));
        }

        System.out.println(hllp.getBytes().length);

=> 25495

According to the source code the sparseSetThreshold only depends of p and is set to 12288 for p = 14. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.

Am I wrong ? Is this behavior expected ?

My second question would be: Do we really want to create the RegisterSet even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions