Optimize IN performance with specialized implementations#19390
Optimize IN performance with specialized implementations#19390geoffreyclaude wants to merge 11 commits into
IN performance with specialized implementations#19390Conversation
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
|
🤖 |
|
run benchmark tpch tpchds |
|
🤖 Hi @Dandandan, thanks for the request (#19390 (comment)).
Please choose one or more of these with |
|
🤖: Benchmark completed Details
|
|
run benchmark tpch tpcds |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
@Dandandan how do I think once this optim is done, there could be a lot to reuse for broadcast joins... |
For plain (non dynamic) filters, I think based on a treshold (<= 3) it either gets planned as a chain of or expressions or using |
7ba1c85 to
276a37f
Compare
|
run benchmark in_list |
276a37f to
d18b346
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
2fc00e5 to
3db393a
Compare
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Thsoe are some pretty sweet performance results. I will try and find some time to review this more carefully |
IN performance with specialized implementations
9ea2d75 to
8cc3e0d
Compare
Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
Introduces a two-stage filter for ByteView types. Stage 1 uses a fast DirectProbeFilter on masked views (len + prefix) for quick rejection; Stage 2 performs full verification only for potential long-string matches. Triggers for Utf8View and BinaryView.
Port of the two-stage View optimization to standard Utf8 and LargeUtf8 types. Encodes strings as i128 (len + prefix) for fast O(1) pre-filtering before falling back to full string comparison. Triggers for Utf8 and LargeUtf8.
FixedSizeBinary(N) arrays share the same contiguous buffer layout as primitive arrays, so for power-of-2 widths (1, 2, 4, 8, 16) we can zero-copy reinterpret them and use the optimized primitive filters (bitmap, branchless, hash) instead of falling through to the NestedTypeFilter fallback.
8cc3e0d to
1448c71
Compare
|
@adriangb @alamb In case you missed it, I've closed this PR and rewritten it as a stack of multiple PRs (one per optim.) The root issue #19241 links them all in order. Due to github limitations, I can't do proper stacked PRs so each PR has all the previous commits in it. But I'll rebase over main as we merge them in order, which will clean them up. |
## Which issue does this PR close? - Part of apache#19241. - This PR was originally proposed as the first commit in the broader `IN LIST` optimization series in apache#19390. - This PR builds on the refactor extracted in apache#21649. ## Rationale for this change After apache#21649, non-primitive constant `IN LIST` evaluation still uses the extracted `ArrayStaticFilter` fallback path. That path relies on comparator checks for each input row. This PR replaces that fallback lookup with a precomputed hash table and shared result construction so generic constant-list evaluation is cheaper before the later specialized primitive and string optimizations from apache#19390. ## What changes are included in this PR? The PR is split so reviewers can separate mechanical cleanup from the behavior/performance changes: 1. `Refactor generic InList static filter helpers` Pure refactoring. This moves the existing generic static-filter construction and probe loop into helper methods inside `ArrayStaticFilter`, without changing the lookup data structure or result semantics. 2. `Build InList results from bitmaps` Changes how the generic path materializes `BooleanArray` results after membership has been computed. Instead of mixing membership checks and SQL three-valued null handling in the row loop, this builds a contains bitmap first and applies the null/negation rules with bitmap operations. This keeps the same `IN` / `NOT IN` semantics, including the `NULL` cases. 3. `Optimize generic InList static filtering` Replaces the fallback lookup storage from a unit-valued raw-entry `HashMap` to `hashbrown::HashTable<usize>`. The table still stores indices into the constant list and still uses Arrow hashing plus `make_comparator` for equality, but avoids the extra map value bookkeeping. The existing specialized primitive filters and dictionary handling are intentionally left out of scope. ## Are these changes tested? Yes. ## Are there any user-facing changes? No. This is an internal performance optimization only. <!-- codex-benchmark-start --> ## Local benchmark snapshot Benchmark command: ```bash cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name> ``` Method: compare adjacent saved baselines using raw Criterion sample minima (`min(time / iters)`). Lower is better; changes within +/-5% are treated as noise. Compared baselines: merge-base -> [apache#21927](apache#21927) Relevant scope: generic fallback string/view/binary rows. Summary: 62 relevant rows, 61 faster, 0 slower, 1 within +/-5%. Largest relevant deltas: | Benchmark | Before | After | Change | |---|---:|---:|---:| | `utf8view/short_8b/list=64/match=0%` | 45.55 us | 21.85 us | -52.0% (2.08x faster) | | `utf8view/short_8b/list=256/match=0%` | 44.34 us | 21.71 us | -51.0% (2.04x faster) | | `utf8view/short_8b/list=16/match=0%` | 44.03 us | 21.60 us | -50.9% (2.04x faster) | | `utf8view/short_8b/list=4/match=0%` | 41.54 us | 20.52 us | -50.6% (2.02x faster) | | `utf8view/len_12b/list=16/match=0%` | 41.43 us | 20.55 us | -50.4% (2.02x faster) | | `utf8view/len_12b/list=64/match=0%` | 41.59 us | 21.00 us | -49.5% (1.98x faster) | | `fixed_size_binary/fsb16/list=10000/match=0%` | 58.11 us | 29.36 us | -49.5% (1.98x faster) | | `fixed_size_binary/fsb16/list=256/match=0%` | 55.49 us | 28.57 us | -48.5% (1.94x faster) | | `utf8view/shared_prefix/pfx=8/list=16/match=0%` | 57.54 us | 32.07 us | -44.3% (1.79x faster) | | `utf8view/mixed_len/list=16/match=0%` | 62.86 us | 35.25 us | -43.9% (1.78x faster) | | `fixed_size_binary/fsb16/list=4/match=0%` | 47.62 us | 27.20 us | -42.9% (1.75x faster) | | `fixed_size_binary/fsb16/list=64/match=0%` | 47.85 us | 27.45 us | -42.6% (1.74x faster) | | `utf8view/mixed_len/list=64/match=0%` | 66.09 us | 38.00 us | -42.5% (1.74x faster) | | `utf8/short_8b/list=256/match=0%` | 52.09 us | 30.49 us | -41.5% (1.71x faster) | | `utf8view/shared_prefix/pfx=12/list=32/match=0%` | 70.61 us | 42.33 us | -40.1% (1.67x faster) | <details> <summary>Full relevant table (62 rows)</summary> | Benchmark | Before | After | Change | |---|---:|---:|---:| | `fixed_size_binary/fsb16/list=10000/match=0%` | 58.11 us | 29.36 us | -49.5% (1.98x faster) | | `fixed_size_binary/fsb16/list=10000/match=50%` | 98.77 us | 81.20 us | -17.8% (1.22x faster) | | `fixed_size_binary/fsb16/list=256/match=0%` | 55.49 us | 28.57 us | -48.5% (1.94x faster) | | `fixed_size_binary/fsb16/list=256/match=50%` | 96.40 us | 79.32 us | -17.7% (1.22x faster) | | `fixed_size_binary/fsb16/list=4/match=0%` | 47.62 us | 27.20 us | -42.9% (1.75x faster) | | `fixed_size_binary/fsb16/list=4/match=50%` | 93.08 us | 75.58 us | -18.8% (1.23x faster) | | `fixed_size_binary/fsb16/list=64/match=0%` | 47.85 us | 27.45 us | -42.6% (1.74x faster) | | `fixed_size_binary/fsb16/list=64/match=50%` | 95.20 us | 74.96 us | -21.3% (1.27x faster) | | `nulls/utf8/long_24b/list=16/match=50%/nulls=20%` | 85.74 us | 74.79 us | -12.8% (1.15x faster) | | `nulls/utf8/short_8b/list=16/match=50%/nulls=20%` | 80.01 us | 77.30 us | -3.4% (within +/-5%) | | `nulls/utf8view/long_24b/list=16/match=50%/nulls=20%` | 110.19 us | 96.52 us | -12.4% (1.14x faster) | | `nulls/utf8view/short_8b/list=16/match=50%/nulls=20%` | 74.78 us | 62.92 us | -15.9% (1.19x faster) | | `nulls/utf8view/short_8b/list=16/match=50%/nulls=20%/NOT_IN` | 71.24 us | 63.51 us | -10.9% (1.12x faster) | | `nulls/utf8view/short_8b/list=16/match=50%/nulls=50%` | 83.84 us | 62.11 us | -25.9% (1.35x faster) | | `utf8/long_24b/list=256/match=0%` | 58.79 us | 37.57 us | -36.1% (1.56x faster) | | `utf8/long_24b/list=256/match=50%` | 107.85 us | 74.62 us | -30.8% (1.45x faster) | | `utf8/long_24b/list=4/match=0%` | 56.68 us | 37.64 us | -33.6% (1.51x faster) | | `utf8/long_24b/list=4/match=50%` | 100.40 us | 79.11 us | -21.2% (1.27x faster) | | `utf8/long_24b/list=64/match=0%` | 59.39 us | 35.95 us | -39.5% (1.65x faster) | | `utf8/long_24b/list=64/match=50%` | 101.26 us | 79.59 us | -21.4% (1.27x faster) | | `utf8/mixed_len/list=16/match=0%` | 60.51 us | 49.06 us | -18.9% (1.23x faster) | | `utf8/mixed_len/list=16/match=50%` | 154.00 us | 139.13 us | -9.7% (1.11x faster) | | `utf8/mixed_len/list=64/match=0%` | 63.46 us | 49.87 us | -21.4% (1.27x faster) | | `utf8/mixed_len/list=64/match=50%` | 154.01 us | 134.01 us | -13.0% (1.15x faster) | | `utf8/shared_prefix/pfx=12/list=32/match=50%` | 98.73 us | 76.64 us | -22.4% (1.29x faster) | | `utf8/short_8b/list=16/match=50%/NOT_IN` | 96.18 us | 72.15 us | -25.0% (1.33x faster) | | `utf8/short_8b/list=256/match=0%` | 52.09 us | 30.49 us | -41.5% (1.71x faster) | | `utf8/short_8b/list=256/match=50%` | 94.56 us | 74.39 us | -21.3% (1.27x faster) | | `utf8/short_8b/list=4/match=0%` | 51.95 us | 32.27 us | -37.9% (1.61x faster) | | `utf8/short_8b/list=4/match=50%` | 95.05 us | 78.47 us | -17.4% (1.21x faster) | | `utf8/short_8b/list=64/match=0%` | 53.60 us | 33.34 us | -37.8% (1.61x faster) | | `utf8/short_8b/list=64/match=50%` | 96.35 us | 80.95 us | -16.0% (1.19x faster) | | `utf8view/len_12b/list=16/match=0%` | 41.43 us | 20.55 us | -50.4% (2.02x faster) | | `utf8view/len_12b/list=16/match=50%` | 73.07 us | 50.49 us | -30.9% (1.45x faster) | | `utf8view/len_12b/list=64/match=0%` | 41.59 us | 21.00 us | -49.5% (1.98x faster) | | `utf8view/len_12b/list=64/match=50%` | 75.23 us | 50.25 us | -33.2% (1.50x faster) | | `utf8view/long_24b/list=16/match=0%` | 58.48 us | 38.22 us | -34.7% (1.53x faster) | | `utf8view/long_24b/list=16/match=50%` | 109.63 us | 87.32 us | -20.4% (1.26x faster) | | `utf8view/long_24b/list=256/match=0%` | 61.12 us | 38.40 us | -37.2% (1.59x faster) | | `utf8view/long_24b/list=256/match=50%` | 113.25 us | 91.61 us | -19.1% (1.24x faster) | | `utf8view/long_24b/list=4/match=0%` | 58.43 us | 39.48 us | -32.4% (1.48x faster) | | `utf8view/long_24b/list=4/match=50%` | 112.73 us | 90.14 us | -20.0% (1.25x faster) | | `utf8view/long_24b/list=64/match=0%` | 62.17 us | 38.48 us | -38.1% (1.62x faster) | | `utf8view/long_24b/list=64/match=50%` | 109.35 us | 87.64 us | -19.8% (1.25x faster) | | `utf8view/mixed_len/list=16/match=0%` | 62.86 us | 35.25 us | -43.9% (1.78x faster) | | `utf8view/mixed_len/list=16/match=50%` | 126.60 us | 103.97 us | -17.9% (1.22x faster) | | `utf8view/mixed_len/list=64/match=0%` | 66.09 us | 38.00 us | -42.5% (1.74x faster) | | `utf8view/mixed_len/list=64/match=50%` | 137.76 us | 112.23 us | -18.5% (1.23x faster) | | `utf8view/shared_prefix/pfx=12/list=32/match=0%` | 70.61 us | 42.33 us | -40.1% (1.67x faster) | | `utf8view/shared_prefix/pfx=12/list=32/match=50%` | 115.15 us | 94.27 us | -18.1% (1.22x faster) | | `utf8view/shared_prefix/pfx=16/list=64/match=0%` | 63.47 us | 40.67 us | -35.9% (1.56x faster) | | `utf8view/shared_prefix/pfx=16/list=64/match=50%` | 112.27 us | 91.32 us | -18.7% (1.23x faster) | | `utf8view/shared_prefix/pfx=8/list=16/match=0%` | 57.54 us | 32.07 us | -44.3% (1.79x faster) | | `utf8view/shared_prefix/pfx=8/list=16/match=50%` | 100.47 us | 82.69 us | -17.7% (1.21x faster) | | `utf8view/short_8b/list=16/match=0%` | 44.03 us | 21.60 us | -50.9% (2.04x faster) | | `utf8view/short_8b/list=16/match=50%` | 72.92 us | 49.10 us | -32.7% (1.49x faster) | | `utf8view/short_8b/list=256/match=0%` | 44.34 us | 21.71 us | -51.0% (2.04x faster) | | `utf8view/short_8b/list=256/match=50%` | 72.43 us | 51.58 us | -28.8% (1.40x faster) | | `utf8view/short_8b/list=4/match=0%` | 41.54 us | 20.52 us | -50.6% (2.02x faster) | | `utf8view/short_8b/list=4/match=50%` | 72.50 us | 48.46 us | -33.2% (1.50x faster) | | `utf8view/short_8b/list=64/match=0%` | 45.55 us | 21.85 us | -52.0% (2.08x faster) | | `utf8view/short_8b/list=64/match=50%` | 73.14 us | 50.92 us | -30.4% (1.44x faster) | </details> <!-- codex-benchmark-end -->
|
Thank you very much for doing this in steps -- I will help move them along I was finding it super challenging to find enough contiguous focus time for this PR. I am feeling very good about the stacked PR approahc -- I'll try and help |
## Which issue does this PR close? - Part of apache#19241. - Stacked on apache#21927. - Next in stack: apache#23012. - Extracted from apache#19390. ## Rationale for this change `IN LIST` evaluates expressions like `x IN (1, 3, 7)`. The list on the right is fixed, so DataFusion can precompute a small lookup structure once and then reuse it for every input row. For `UInt8`, there are only 256 possible values: 0 through 255. That means the lookup can be a tiny checklist with one bit per possible value: - If the list contains `3`, set bit `3`. - If the list contains `7`, set bit `7`. - To check whether an input value is present, read that one bit. So instead of hashing each input value or comparing it against the list, membership becomes one indexed bit test. The bitmap is only 32 bytes, because 256 bits = 32 bytes. This PR adds the first specialized primitive path in the stack as a concrete `UInt8` filter. The `UInt16` version is added in apache#23012, and the shared bitmap abstraction is introduced only after both concrete implementations are visible in apache#23035. ## What changes are included in this PR? - Adds `UInt8BitmapFilter`, a 32-byte bitmap built from the non-null constants in the `IN` list. - Routes `UInt8` constant-list filtering to that bitmap path. - Keeps the same SQL null behavior as the generic path for both `IN` and `NOT IN`. - Moves shared dictionary-needle handling into `static_filter.rs`, so specialized filters can reuse it consistently. - Adds focused tests for `UInt8` null handling and dictionary-encoded needles. ## Are these changes tested? Yes. - `cargo fmt --all` - `cargo test -p datafusion-physical-expr bitmap_filter_u8 --lib` - `cargo test -p datafusion-physical-expr in_list_int_types --lib` - `cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings` ## Are there any user-facing changes? No. This is an internal performance optimization only. <!-- codex-benchmark-start --> ## Local benchmark snapshot Benchmark command: ```bash cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name> ``` Method: compare adjacent saved baselines using raw Criterion sample minima (`min(time / iters)`). Lower is better; changes within +/-5% are treated as noise. These numbers were not rerun after splitting the bitmap abstraction into apache#23035. Compared baselines: [apache#21927](apache#21927) -> [apache#23011](apache#23011) Relevant scope: UInt8 narrow-integer rows. Summary: 5 relevant rows, 5 faster, 0 slower, 0 within +/-5%. | Benchmark | Before | After | Change | |---|---:|---:|---:| | `narrow_integer/u8/list=16/match=0%` | 20.39 us | 3.94 us | -80.7% (5.18x faster) | | `narrow_integer/u8/list=16/match=50%` | 38.38 us | 3.98 us | -89.6% (9.65x faster) | | `narrow_integer/u8/list=4/match=0%` | 18.18 us | 3.93 us | -78.4% (4.62x faster) | | `narrow_integer/u8/list=4/match=50%` | 34.63 us | 3.96 us | -88.6% (8.75x faster) | | `nulls/narrow_integer/u8/list=16/match=50%/nulls=20%` | 37.12 us | 4.16 us | -88.8% (8.93x faster) | <!-- codex-benchmark-end --> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close? - Part of apache#19241. - Stacked on apache#23011. - Next in stack: apache#23035. - Extracted from apache#19390. ## Rationale for this change apache#23011 uses a bitmap checklist for `UInt8`, where there are 256 possible values. `UInt16` is the same idea with a larger value range: 0 through 65,535. That is still small enough to represent directly. A `UInt16` bitmap needs one bit for each possible value: - 65,536 possible values - 65,536 bits total - 8 KB of memory Then a lookup is still simple: use the input value as the bit position and check whether that bit is set. For example, if the list contains `42`, bit `42` is set, and every input row with value `42` can be recognized with one bit test. This PR keeps the scope narrow: it adds the unsigned 2-byte bitmap path as a concrete `UInt16` filter. apache#23035 then unifies the `UInt8` and `UInt16` implementations, and apache#23013 uses that shared shape for signed same-width reinterpretation. ## What changes are included in this PR? - Adds `UInt16BitmapFilter`, backed by a heap-allocated 65,536-bit bitmap. - Routes `UInt16` constant-list filtering to that bitmap path. - Keeps the same `IN` / `NOT IN` null behavior as the generic path. - Adds focused coverage for `UInt16` boundary values, nulls, and `NOT IN`. ## Are these changes tested? Yes. - `cargo fmt --all` - `cargo test -p datafusion-physical-expr bitmap_filter_u16 --lib` - `cargo test -p datafusion-physical-expr in_list_int_types --lib` - `cargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --lib` - `cargo test -p datafusion-physical-expr test_in_list_dictionary_types --lib` - `cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings` ## Are there any user-facing changes? No. This is an internal performance optimization only. <!-- codex-benchmark-start --> ## Benchmark note No local `in_list_strategy` numbers are included for this PR because the benchmark harness does not currently include a direct `UInt16` case. The available `i16` rows measure the signed reinterpretation path added in apache#23013 after the bitmap unification in apache#23035, not this PR's unsigned `UInt16` bitmap filter. <!-- codex-benchmark-end -->
Status
This aggregate PR has been superseded by the split stacked series for #19241.
The review path is now:
Each PR now owns one focused step in the optimization stack, with its own explanation and CI signal. Closing this aggregate PR avoids duplicate review on the same work.
#19241 remains the umbrella issue for the overall
IN LISTperformance work.