-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[common] Introduce BitmapFileIndexMetaV2. #5028
base: master
Are you sure you want to change the base?
Conversation
return block.getLength(bitmapId); | ||
} | ||
|
||
private BitmapIndexBlock findBlock(Object bitmapId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can also using binarySearch
too.
private BitmapIndexBlock findBlock(Object bitmapId) {
Comparator<Object> comparator = getComparator(dataType);
// binary search
int low = 0;
int high = indexBlocks.size() - 1;
while (low <= high) {
int mid = (low + high) >>> 1;
Object midVal = indexBlocks.get(mid).key;
int cmp = comparator.compare(midVal, bitmapId);
if (cmp < 0) {
low = mid + 1;
} else if (cmp > 0) {
high = mid - 1;
} else {
return indexBlocks.get(mid);
}
}
return low > 0 ? indexBlocks.get(low - 1) : null;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, It seems to be have NPE problem if findBlock
return null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, It seems to be have NPE problem if findBlock return null.
Ignore it, but the findBlock
will be called three times if bitmapId exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -235,8 +271,18 @@ private RoaringBitmap32 readBitmap(Object bitmapId) { | |||
if (offset < 0) { | |||
return RoaringBitmap32.bitmapOf(-1 - offset); | |||
} else { | |||
seekableInputStream.seek(bodyStart + offset); | |||
seekableInputStream.seek(bitmapFileIndexMeta.getBodyStart() + offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will call findBlocks
in three times (contains
, getOffset
, getLength
) if match in V2 format. Maybe we can avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem is also exists in the BitmapIndexBlock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata information of blocks is in memory, and binary search is used, so the performance is not bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes me feels obsessive-compulsive. So I want to suggest one more time.
For V2: find the blocks and then find the entry, only 2 binary search, but now need to binary search six times.
What do you think about this? Using the findEntry
instead of contains
/getOffset
/getLength
.
private RoaringBitmap32 readBitmap(Object bitmapId) {
Entry entry = bitmapFileIndexMeta.findEntry(bitmapId);
if (entry == null) {
return new RoaringBitmap32();
}
int offset = entry.getOffset();
int length = entry.getLength();
if (offset < 0) {
return RoaringBitmap32.bitmapOf(-1 - offset);
}
try {
seekableInputStream.seek(bitmapFileIndexMeta.getBodyStart() + offset);
RoaringBitmap32 bitmap = new RoaringBitmap32();
if (enableNextOffsetToSize && length != -1) {
DataInputStream input = new DataInputStream(seekableInputStream);
byte[] bytes = new byte[length];
input.readFully(bytes);
bitmap.deserialize(ByteBuffer.wrap(bytes));
} else {
bitmap.deserialize(new DataInputStream(seekableInputStream));
}
return bitmap;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepted, thanks to JiaLiang for checking the code so carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hang8929201 Thanks for your patience!
+1 cc @JingsongLi
Purpose
When the bitmap-indexed column cardinality is high, using the first version of the bitmap index format will take a lot of time to read the entire dictionary. But in fact we don't need a full dictionary when dealing with a small number of predicates, the performance of predicate hits on the bitmap can be improved by creating a secondary index on the dictionary.
https://docs.google.com/document/d/11dJlGlSX3dwYKKrPN0DQ2XQTsx6d9wI6DTBIiiBwomM/edit?tab=t.0
performance
cardinality 1000:
cardinality 10000:
cardinality 30000:
cardinality 50000:
cardinality 80000:
cardinality 100000:
Tests
org.apache.paimon.fileindex.bitmapindex.TestBitmapFileIndex
org.apache.paimon.spark.SparkFileIndexITCase
org.apache.paimon.benchmark.bitmap.BitmapIndexBenchmark
API and Format
Documentation
docs/content/concepts/spec/fileindex.md