Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[common] Introduce BitmapFileIndexMetaV2. #5028

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

hang8929201
Copy link
Contributor

Purpose

When the bitmap-indexed column cardinality is high, using the first version of the bitmap index format will take a lot of time to read the entire dictionary. But in fact we don't need a full dictionary when dealing with a small number of predicates, the performance of predicate hits on the bitmap can be improved by creating a secondary index on the dictionary.

https://docs.google.com/document/d/11dJlGlSX3dwYKKrPN0DQ2XQTsx6d9wI6DTBIiiBwomM/edit?tab=t.0

performance

cardinality 1000:
image

cardinality 10000:
image

cardinality 30000:
image

cardinality 50000:
image

cardinality 80000:
image

cardinality 100000:
image

Tests

org.apache.paimon.fileindex.bitmapindex.TestBitmapFileIndex
org.apache.paimon.spark.SparkFileIndexITCase
org.apache.paimon.benchmark.bitmap.BitmapIndexBenchmark

API and Format

Documentation

docs/content/concepts/spec/fileindex.md

return block.getLength(bitmapId);
}

private BitmapIndexBlock findBlock(Object bitmapId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can also using binarySearch too.

    private BitmapIndexBlock findBlock(Object bitmapId) {
        Comparator<Object> comparator = getComparator(dataType);

        // binary search
        int low = 0;
        int high = indexBlocks.size() - 1;
        while (low <= high) {
            int mid = (low + high) >>> 1;
            Object midVal = indexBlocks.get(mid).key;
            int cmp = comparator.compare(midVal, bitmapId);

            if (cmp < 0) {
                low = mid + 1;
            } else if (cmp > 0) {
                high = mid - 1;
            } else {
                return indexBlocks.get(mid);
            }
        }

        return low > 0 ? indexBlocks.get(low - 1) : null;
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, It seems to be have NPE problem if findBlock return null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, It seems to be have NPE problem if findBlock return null.

Ignore it, but the findBlock will be called three times if bitmapId exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -235,8 +271,18 @@ private RoaringBitmap32 readBitmap(Object bitmapId) {
if (offset < 0) {
return RoaringBitmap32.bitmapOf(-1 - offset);
} else {
seekableInputStream.seek(bodyStart + offset);
seekableInputStream.seek(bitmapFileIndexMeta.getBodyStart() + offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will call findBlocks in three times (contains, getOffset, getLength) if match in V2 format. Maybe we can avoid this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem is also exists in the BitmapIndexBlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata information of blocks is in memory, and binary search is used, so the performance is not bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me feels obsessive-compulsive. So I want to suggest one more time.

For V2: find the blocks and then find the entry, only 2 binary search, but now need to binary search six times.

What do you think about this? Using the findEntry instead of contains/getOffset/getLength.

private RoaringBitmap32 readBitmap(Object bitmapId) {
    Entry entry = bitmapFileIndexMeta.findEntry(bitmapId);
    if (entry == null) {
        return new RoaringBitmap32();
    }

    int offset = entry.getOffset();
    int length = entry.getLength();
    if (offset < 0) {
        return RoaringBitmap32.bitmapOf(-1 - offset);
    }

    try {
        seekableInputStream.seek(bitmapFileIndexMeta.getBodyStart() + offset);
        RoaringBitmap32 bitmap = new RoaringBitmap32();
        if (enableNextOffsetToSize && length != -1) {
            DataInputStream input = new DataInputStream(seekableInputStream);
            byte[] bytes = new byte[length];
            input.readFully(bytes);
            bitmap.deserialize(ByteBuffer.wrap(bytes));
        } else {
            bitmap.deserialize(new DataInputStream(seekableInputStream));
        }
        return bitmap;
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted, thanks to JiaLiang for checking the code so carefully.

Copy link
Contributor

@Tan-JiaLiang Tan-JiaLiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hang8929201 Thanks for your patience!

+1 cc @JingsongLi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants