Use bloom filter for evaluating dynamic filters on strings #24528

raunaqmorarka · 2024-12-19T11:20:41Z

Description

BenchmarkDynamicPageFilter.filterPages
    (filterSize)  (inputDataSet)  (inputNullChance)  (nonNullsSelectivity)  (nullsAllowed)   Mode  Cnt     Before Score      After Score  Units
               2  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   80.908 ± 1.927  172.244 ± 1.067  ops/s
               5  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   81.052 ± 2.569  175.619 ± 1.225  ops/s
              10  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   76.787 ± 1.561  176.371 ± 0.559  ops/s
             100  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   75.631 ± 1.372  174.288 ± 1.024  ops/s
            1000  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   69.615 ± 0.721  173.340 ± 0.867  ops/s
           10000  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   75.401 ± 1.233  173.285 ± 1.752  ops/s
          100000  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   64.335 ± 2.936  170.087 ± 1.370  ops/s
         1000000  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   16.808 ± 3.205  170.403 ± 1.471  ops/s
         5000000  VARCHAR_RANDOM               0.01                    0.2           false  thrpt   10   15.766 ± 0.820  150.588 ± 4.034  ops/s

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Improve performance of selective joins on strings. ({issue}`24528`)

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

wendigo · 2024-12-19T11:36:36Z

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

+            bloom = new long[bloomSize];
+            bloomSizeMask = bloomSize - 1;
+            for (Slice value : values) {
+                long hashCode = XxHash64.hash(value);


Slice has a hashCode that is using XxHash64 already (and is memoized). Just value.hashCode()

These Slices are temporary objects that are created from a single contiguous Block, depending on the Type the Slice may be subject to truncation and padding as well.
So I don't think we gain anything by memoized hash code.
On the other hand, the hashing logic for bloom filter could evolve to be different from Slice's hashCode implementation.

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java

findinpath · 2024-12-19T12:50:24Z

Could you please add a high-level description about where the oprimizations proposed in this PR would apply.
I'm particularly interested in a SQL sketch where you've observer/foresee that the engine will perform better.

lukasz-stec

Looks great generally

lukasz-stec · 2024-12-20T12:34:10Z

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

+            for (Slice value : values) {
+                long hashCode = XxHash64.hash(value);
+                // Set 3 bits in a 64 bit word
+                bloom[bloomIndex(hashCode)] |= bloomMask(hashCode);


Did you consider using an open hash table of xxhash codes instead of the bloom filter? This could trade some performance for more accuracy.

I want to use this eventually for collecting and evaluation dynamic filters with millions of distinct values, so I want the trade-offs to be in favor of using less memory and CPU

lukasz-stec · 2024-12-20T12:40:23Z

core/trino-main/src/main/java/io/trino/sql/gen/columnar/DynamicPageFilter.java

+        List<Supplier<FilterEvaluator>> subExpressionEvaluators = currentPredicate.getDomains().orElseThrow()
+                .entrySet().stream()
+                .map(entry -> {
+                    if (canUseBloomFilter(entry.getValue())) {


Just an idea. Potentially we could use a less accurate bloom filter (limited in size) for a dynamic filter with too many values for a normal filter if the accuracy is worth it.

lukasz-stec · 2024-12-20T12:52:54Z

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

@@ -0,0 +1,181 @@
+/*


If I understand correctly we replace the current implementation that uses ObjectOpenCustomHashSet for the bloom filter. That trades the accuracy of the filter for performance. Could you make that explicit in the commit message?
Do you have an estimate of this bloom filter accuracy? Looks like it is pretty good given the size of the filter i.e. only conflicts matter.

In io.trino.sql.gen.TestDynamicPageFilter#testSliceBloomFilter there is an assertion which checks that accuracy for a filter with 0.1 selectivity is between (0.1, 0.115). It's a bit less accurate than the more canonical bloom filter implementations in orc and parquet, but it's significantly faster.

updated the commit message

wendigo · 2024-12-31T13:37:53Z

@raunaqmorarka did you run benchmarks? (TPCH)

raunaqmorarka · 2024-12-31T13:47:15Z

@raunaqmorarka did you run benchmarks? (TPCH)

Joins in TPC benchmarks are mostly on bigints, so this doesn't matter there, I'll run something manually for that

Improves efficiency of evaluating dynamic filters on strings with the potential for some false positives compared to exsitng approach BenchmarkDynamicPageFilter.filterPages (filterSize) (inputDataSet) (inputNullChance) (nonNullsSelectivity) (nullsAllowed) Mode Cnt Before Score After Score Units 2 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 80.908 ± 1.927 172.244 ± 1.067 ops/s 5 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 81.052 ± 2.569 175.619 ± 1.225 ops/s 10 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 76.787 ± 1.561 176.371 ± 0.559 ops/s 100 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 75.631 ± 1.372 174.288 ± 1.024 ops/s 1000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 69.615 ± 0.721 173.340 ± 0.867 ops/s 10000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 75.401 ± 1.233 173.285 ± 1.752 ops/s 100000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 64.335 ± 2.936 170.087 ± 1.370 ops/s 1000000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 16.808 ± 3.205 170.403 ± 1.471 ops/s 5000000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 15.766 ± 0.820 150.588 ± 4.034 ops/s

cla-bot bot added the cla-signed label Dec 19, 2024

raunaqmorarka added the performance label Dec 19, 2024

raunaqmorarka requested review from lukasz-stec, martint, dain, sopel39 and Dith3r December 19, 2024 11:21

wendigo reviewed Dec 19, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

wendigo reviewed Dec 19, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

wendigo reviewed Dec 19, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the df-bloom branch from 137c6e9 to 0edc826 Compare December 19, 2024 11:28

wendigo reviewed Dec 19, 2024

View reviewed changes

raunaqmorarka force-pushed the df-bloom branch 4 times, most recently from 13b8ccd to d8b44ff Compare December 20, 2024 07:13

lukasz-stec approved these changes Dec 20, 2024

View reviewed changes

raunaqmorarka force-pushed the df-bloom branch 3 times, most recently from ffdebcc to b377b55 Compare December 31, 2024 13:10

raunaqmorarka requested review from wendigo and lukasz-stec December 31, 2024 13:34

raunaqmorarka force-pushed the df-bloom branch from b377b55 to 0e6b146 Compare December 31, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bloom filter for evaluating dynamic filters on strings #24528

Use bloom filter for evaluating dynamic filters on strings #24528

raunaqmorarka commented Dec 19, 2024 •

edited

Loading

wendigo Dec 19, 2024

raunaqmorarka Dec 31, 2024

findinpath commented Dec 19, 2024

lukasz-stec left a comment

lukasz-stec Dec 20, 2024

raunaqmorarka Dec 31, 2024

lukasz-stec Dec 20, 2024

lukasz-stec Dec 20, 2024

raunaqmorarka Dec 31, 2024

raunaqmorarka Dec 31, 2024

wendigo commented Dec 31, 2024

raunaqmorarka commented Dec 31, 2024 •

edited

Loading

Use bloom filter for evaluating dynamic filters on strings #24528

Are you sure you want to change the base?

Use bloom filter for evaluating dynamic filters on strings #24528

Conversation

raunaqmorarka commented Dec 19, 2024 • edited Loading

Description

Additional context and related issues

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findinpath commented Dec 19, 2024

lukasz-stec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wendigo commented Dec 31, 2024

raunaqmorarka commented Dec 31, 2024 • edited Loading

raunaqmorarka commented Dec 19, 2024 •

edited

Loading

raunaqmorarka commented Dec 31, 2024 •

edited

Loading