Skip to content

POC: Sketch out cached filter result API #7513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 15, 2025

Draft until:

  • Pull out StringViewBuilder::concat_array into its own PR
  • Avoid double buffering of intermediate results
  • Add memory limit for results cache

Which issue does this PR close?

Rationale for this change

I am trying to sketch out enough of a cached filter result API to show performance improvements. Once I have done that, I will start proposing how to break it up into smaller PRs

What changes are included in this PR?

  1. Add code to cache columns which are reused in filter and scan

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 15, 2025
@alamb alamb force-pushed the alamb/cache_filter_result branch from 78f96d1 to 31f2fa1 Compare May 15, 2025 19:39
@alamb alamb force-pushed the alamb/cache_filter_result branch from 31f2fa1 to 244e187 Compare May 15, 2025 20:33
filters: Vec<BooleanArray>,
}

impl CachedPredicateResultBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is very clear to get the cached result!

@alamb alamb force-pushed the alamb/cache_filter_result branch 2 times, most recently from 8961196 to 9e91e9f Compare May 16, 2025 12:48
/// TODO: potentially incrementally build the result of the predicate
/// evaluation without holding all the batches in memory. See
/// <https://github.com/apache/arrow-rs/issues/6692>
in_progress_arrays: Vec<Box<dyn InProgressArray>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb ,

Does it mean, this in_progress_arrays is not the final result for us to generate the final batch?

For example:
Predicate a > 1 => in_progress_array_a filtered by a > 1
Predicate b >2 => in_progress_array_b filtered by b > 2 also based filtered by a > 1, but we don't update the in_progress_array_a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent question

What I was thinking is that CachedPredicateResult would manage the "currently" applied predicate

So in the case where there are multiple predicates, I was thinking of a method like CachedPredicateResult::merge method which could take the result of filtering a and apply the result of filtering by b

We can then put heuristics / logic for if/when we materialize the filters into CachedPredicateResult

But that is sort of speculation at this point -- I don't have it all worked out yet

My plan is to get far enough to show this structure works and can improve performance, and then I'll work on the trickier logic of applying multiple filters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CachedPredicateResult::merge method which could take the result of filtering a and apply the result of filtering by b

Great idea!

But that is sort of speculation at this point -- I don't have it all worked out yet

Sure, i will continue to review, thank you @alamb !

@alamb alamb force-pushed the alamb/cache_filter_result branch from 9e91e9f to 147c7a7 Compare May 16, 2025 14:50
@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

I tested this branch using a query that filters and selects the same column (NOTE it is critical to NOT use --all-features as all features turns on force_validate

cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q24

Here are the benchmark results (30ms --> 22ms) (25 % faster)

Gnuplot not found, using plotters backend
Looking for ClickBench files starting in current_dir and all parent directories: "/Users/andrewlamb/Software/arrow-rs/parquet"
arrow_reader_clickbench/sync/Q24
                        time:   [22.532 ms 22.604 ms 22.682 ms]
                        change: [-27.751% -27.245% -26.791%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

arrow_reader_clickbench/async/Q24
                        time:   [24.043 ms 24.171 ms 24.308 ms]
                        change: [-26.223% -25.697% -25.172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

I realize this branch currently uses more memory (to buffer the filter results), but I think the additional memory growth can be limited with a setting.

@alamb alamb force-pushed the alamb/cache_filter_result branch from 147c7a7 to f1f7103 Compare May 16, 2025 15:08
@zhuqi-lucas
Copy link
Contributor

I tested this branch using a query that filters and selects the same column (NOTE it is critical to NOT use --all-features as all features turns on force_validate

cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q24

Here are the benchmark results (30ms --> 22ms) (25 % faster)

Gnuplot not found, using plotters backend
Looking for ClickBench files starting in current_dir and all parent directories: "/Users/andrewlamb/Software/arrow-rs/parquet"
arrow_reader_clickbench/sync/Q24
                        time:   [22.532 ms 22.604 ms 22.682 ms]
                        change: [-27.751% -27.245% -26.791%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

arrow_reader_clickbench/async/Q24
                        time:   [24.043 ms 24.171 ms 24.308 ms]
                        change: [-26.223% -25.697% -25.172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

I realize this branch currently uses more memory (to buffer the filter results), but I think the additional memory growth can be limited with a setting.

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

Thanks -- I think one potential problem is that the cached results may consume too much memory (but I will try and handle that shortly)

I think we should proceed with starting to merge some refactorings; I left some suggestions here:

@zhuqi-lucas
Copy link
Contributor

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

Thanks -- I think one potential problem is that the cached results may consume too much memory (but I will try and handle that shortly)

I think we should proceed with starting to merge some refactorings; I left some suggestions here:

It makes sense! Thank you @alamb.

@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (f1f7103) to 1a5999a diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.03ms        ? ?/sec    1.15      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     12.9±0.06ms        ? ?/sec    1.08     13.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     14.9±0.16ms        ? ?/sec    1.06     15.8±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     24.4±0.26ms        ? ?/sec    1.59     38.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     37.6±0.33ms        ? ?/sec    1.39     52.3±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     35.5±0.24ms        ? ?/sec    1.41     50.0±0.37ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.1±0.05ms        ? ?/sec    1.00      5.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.6±0.51ms        ? ?/sec    1.42    162.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    132.0±0.61ms        ? ?/sec    1.59    209.4±0.69ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    200.4±0.94ms        ? ?/sec    2.12    425.7±1.52ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   414.9±12.61ms        ? ?/sec    1.18   491.6±11.23ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     41.9±0.46ms        ? ?/sec    1.38     57.7±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    105.5±0.37ms        ? ?/sec    1.58    166.9±1.13ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    103.1±0.51ms        ? ?/sec    1.59    164.1±0.89ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.00     64.2±0.58ms        ? ?/sec    1.00     64.2±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.38    234.5±1.60ms        ? ?/sec    1.00    169.6±0.96ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.58    162.6±0.64ms        ? ?/sec    1.00    102.6±0.55ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.00     38.9±0.26ms        ? ?/sec    1.00     39.1±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.00     48.5±0.26ms        ? ?/sec    1.00     48.6±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     47.8±0.32ms        ? ?/sec    1.11     53.2±0.48ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     40.0±0.30ms        ? ?/sec    1.00     39.9±0.42ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.3±0.07ms        ? ?/sec    1.00     14.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00  1848.0±17.20µs        ? ?/sec    1.19      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     12.0±0.08ms        ? ?/sec    1.05     12.6±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     13.8±0.07ms        ? ?/sec    1.05     14.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     25.5±1.92ms        ? ?/sec    1.59     40.6±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     36.2±1.26ms        ? ?/sec    1.49     54.0±2.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     33.9±0.35ms        ? ?/sec    1.52     51.4±0.30ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.04      4.4±0.11ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00    138.2±1.41ms        ? ?/sec    1.29    178.9±1.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    135.0±1.00ms        ? ?/sec    1.75    236.5±1.61ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    198.4±4.34ms        ? ?/sec    2.47    490.2±2.72ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00    375.7±8.92ms        ? ?/sec    1.16   433.9±10.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     38.5±0.44ms        ? ?/sec    1.42     54.8±0.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.00     95.3±0.41ms        ? ?/sec    1.64    156.6±0.97ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.00     93.2±0.54ms        ? ?/sec    1.65    153.8±0.74ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.01     62.3±0.39ms        ? ?/sec    1.00     61.8±0.38ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     5.27   835.6±11.42ms        ? ?/sec    1.00    158.6±0.82ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     5.89    561.1±3.23ms        ? ?/sec    1.00     95.2±0.51ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.00     31.6±0.24ms        ? ?/sec    1.00     31.7±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.01     35.0±0.32ms        ? ?/sec    1.00     34.7±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     44.2±0.28ms        ? ?/sec    1.12     49.3±0.33ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     37.1±0.29ms        ? ?/sec    1.00     36.8±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.00     13.6±0.06ms        ? ?/sec    1.00     13.5±0.06ms        ? ?/sec

@zhuqi-lucas
Copy link
Contributor

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.03ms        ? ?/sec    1.15      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     12.9±0.06ms        ? ?/sec    1.08     13.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     14.9±0.16ms        ? ?/sec    1.06     15.8±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     24.4±0.26ms        ? ?/sec    1.59     38.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     37.6±0.33ms        ? ?/sec    1.39     52.3±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     35.5±0.24ms        ? ?/sec    1.41     50.0±0.37ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.1±0.05ms        ? ?/sec    1.00      5.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.6±0.51ms        ? ?/sec    1.42    162.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    132.0±0.61ms        ? ?/sec    1.59    209.4±0.69ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    200.4±0.94ms        ? ?/sec    2.12    425.7±1.52ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   414.9±12.61ms        ? ?/sec    1.18   491.6±11.23ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     41.9±0.46ms        ? ?/sec    1.38     57.7±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    105.5±0.37ms        ? ?/sec    1.58    166.9±1.13ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    103.1±0.51ms        ? ?/sec    1.59    164.1±0.89ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.00     64.2±0.58ms        ? ?/sec    1.00     64.2±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.38    234.5±1.60ms        ? ?/sec    1.00    169.6±0.96ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.58    162.6±0.64ms        ? ?/sec    1.00    102.6±0.55ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.00     38.9±0.26ms        ? ?/sec    1.00     39.1±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.00     48.5±0.26ms        ? ?/sec    1.00     48.6±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     47.8±0.32ms        ? ?/sec    1.11     53.2±0.48ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     40.0±0.30ms        ? ?/sec    1.00     39.9±0.42ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.3±0.07ms        ? ?/sec    1.00     14.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00  1848.0±17.20µs        ? ?/sec    1.19      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     12.0±0.08ms        ? ?/sec    1.05     12.6±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     13.8±0.07ms        ? ?/sec    1.05     14.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     25.5±1.92ms        ? ?/sec    1.59     40.6±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     36.2±1.26ms        ? ?/sec    1.49     54.0±2.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     33.9±0.35ms        ? ?/sec    1.52     51.4±0.30ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.04      4.4±0.11ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00    138.2±1.41ms        ? ?/sec    1.29    178.9±1.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    135.0±1.00ms        ? ?/sec    1.75    236.5±1.61ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    198.4±4.34ms        ? ?/sec    2.47    490.2±2.72ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00    375.7±8.92ms        ? ?/sec    1.16   433.9±10.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     38.5±0.44ms        ? ?/sec    1.42     54.8±0.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.00     95.3±0.41ms        ? ?/sec    1.64    156.6±0.97ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.00     93.2±0.54ms        ? ?/sec    1.65    153.8±0.74ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.01     62.3±0.39ms        ? ?/sec    1.00     61.8±0.38ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     5.27   835.6±11.42ms        ? ?/sec    1.00    158.6±0.82ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     5.89    561.1±3.23ms        ? ?/sec    1.00     95.2±0.51ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.00     31.6±0.24ms        ? ?/sec    1.00     31.7±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.01     35.0±0.32ms        ? ?/sec    1.00     34.7±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     44.2±0.28ms        ? ?/sec    1.12     49.3±0.33ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     37.1±0.29ms        ? ?/sec    1.00     36.8±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.00     13.6±0.06ms        ? ?/sec    1.00     13.5±0.06ms        ? ?/sec

It seems regression for Q36/Q37.

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

It seems regression for Q36/Q37.

Yes, I agree -- I will figure out why

@alamb alamb force-pushed the alamb/cache_filter_result branch from f1f7103 to a0e4b29 Compare May 20, 2025 17:12
@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

It seems regression for Q36/Q37.

Yes, I agree -- I will figure out why

I did some profiling:

samply record target/release/deps/arrow_reader_clickbench-aef15514767c9665 --bench arrow_reader_clickbench/sync/Q36

Basically, the issue is that calling slice() is taking a non trivial amount of the time for Q36

Screenshot 2025-05-20 at 1 23 25 PM

I added some printlns and it seems like we have 181k rows in total that pass but the number of buffers is crazy (I think this is related to concat not compacting the ByteViewArray). Working on this...

ByteViewArray::slice offset=8192 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=16384 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=24576 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=32768 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=40960 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=49152 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=57344 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=65536 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=73728 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=81920 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=90112 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=98304 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=106496 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=114688 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=122880 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=131072 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=139264 length=8192, total_rows: 181198 buffer_count: 542225

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (c0c3eb4) to 45bda04 diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.01ms        ? ?/sec    1.16      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     14.2±0.16ms        ? ?/sec    1.03     14.7±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     16.1±0.14ms        ? ?/sec    1.03     16.5±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     27.4±0.33ms        ? ?/sec    1.39     38.0±0.30ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     39.9±0.33ms        ? ?/sec    1.29     51.6±0.41ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     38.3±0.34ms        ? ?/sec    1.30     49.7±0.31ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.2±0.07ms        ? ?/sec    1.00      5.1±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.5±0.73ms        ? ?/sec    1.38    158.5±0.67ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    131.5±0.79ms        ? ?/sec    1.68    220.4±1.03ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    234.3±8.23ms        ? ?/sec    2.07    486.1±2.04ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   440.6±13.11ms        ? ?/sec    1.11   489.2±17.69ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     45.0±0.37ms        ? ?/sec    1.29     58.1±0.59ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    119.0±0.58ms        ? ?/sec    1.36    161.5±0.80ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    115.4±0.73ms        ? ?/sec    1.39    160.0±0.95ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.01     65.7±0.48ms        ? ?/sec    1.00     64.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.00    129.4±0.83ms        ? ?/sec    1.29    167.2±0.84ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.00     99.2±0.68ms        ? ?/sec    1.00     98.9±0.53ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.01     39.9±0.27ms        ? ?/sec    1.00     39.5±0.30ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.01     49.4±0.40ms        ? ?/sec    1.00     49.0±0.38ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     49.1±0.66ms        ? ?/sec    1.09     53.5±0.43ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     41.2±0.47ms        ? ?/sec    1.00     41.0±0.40ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.7±0.18ms        ? ?/sec    1.00     14.6±0.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00   1843.8±9.23µs        ? ?/sec    1.20      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     13.0±0.07ms        ? ?/sec    1.03     13.3±0.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     14.8±0.11ms        ? ?/sec    1.02     15.2±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     32.4±0.50ms        ? ?/sec    1.25     40.6±0.31ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     44.3±0.42ms        ? ?/sec    1.21     53.7±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     42.9±0.51ms        ? ?/sec    1.22     52.3±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.02      4.4±0.02ms        ? ?/sec    1.00      4.3±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00   121.8±10.69ms        ? ?/sec    1.44    175.5±1.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    137.2±9.68ms        ? ?/sec    1.70    233.1±1.71ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    214.2±9.00ms        ? ?/sec    2.22    475.1±3.54ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00   383.2±15.35ms        ? ?/sec    1.16   442.7±15.50ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     41.7±0.48ms        ? ?/sec    1.31     54.5±0.58ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.13   172.6±10.81ms        ? ?/sec    1.00    152.3±1.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.06    158.6±6.71ms        ? ?/sec    1.00    150.2±0.76ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.03     64.3±0.70ms        ? ?/sec    1.00     62.5±0.48ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     1.00    119.8±0.89ms        ? ?/sec    1.31    157.5±0.88ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     1.01     93.6±0.71ms        ? ?/sec    1.00     92.3±0.40ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.02     32.3±0.25ms        ? ?/sec    1.00     31.7±0.21ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.02     35.1±0.39ms        ? ?/sec    1.00     34.3±0.26ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     45.5±0.48ms        ? ?/sec    1.11     50.5±0.37ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     38.2±0.32ms        ? ?/sec    1.00     37.9±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.01     13.7±0.07ms        ? ?/sec    1.00     13.6±0.06ms        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

🤖: Benchmark completed

Well, that is looking quite a bit better :bowtie:

I am now working on a way to reduce buffering requirements (will require incremental concat'ing)

@zhuqi-lucas
Copy link
Contributor

🤖: Benchmark completed

Well, that is looking quite a bit better :bowtie:

I am now working on a way to reduce buffering requirements (will require incremental concat'ing)

Amazing result @alamb , it looks pretty cool!

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 22, 2025
@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (76fcb56) to 0a4ffa5 diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@zhuqi-lucas
Copy link
Contributor

Status update

Current State of this PR

  1. Caches the results of the most recent filter which is applied during parquet decode
  2. Contains an initial implementation of ArrayBuilderExtFilter and ArrayBuilderExtConcat which permit incrementally building arrays without materializing the intermediate results (prototype API from Optimize take/filter/concat from multiple input arrays to a single large output array #6692)
  3. Contains IncrementalRecordBatchBuilder that incrementally builds record batches from filtered results.

The use of the incremental builders saves at least one memory copy during filtering and reduces the buffering required (which also might increase speed). It will also reduce the times we have to rewrite StringView which will help

Next Steps

I next plan to:

  1. Run arrow-rs benchmarks to show it helping
  2. Do a POC in DataFusion using the IncrementalRecordBatchBuilder in FilterExec to see if it makes a difference there

If those tests look good, I will begin breaking this PR up into smaller pieces for review

Major items I know are needed:

  1. Memory limiting for cached results in the parquet reader
  2. Updating previous cached results with subsequent filters
  3. Benchmarks showing the effect of using incremental filtering / append compared to filter and concat

Great work, thank you @alamb , i will study and review the details code tomorrow!

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00  1994.2±19.90µs        ? ?/sec    1.18      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     13.6±0.12ms        ? ?/sec    1.08     14.7±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     15.4±0.11ms        ? ?/sec    1.08     16.6±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     25.5±0.29ms        ? ?/sec    1.53     39.1±0.23ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     38.1±0.40ms        ? ?/sec    1.38     52.5±0.49ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     36.3±0.24ms        ? ?/sec    1.39     50.5±0.34ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.02      5.0±0.05ms        ? ?/sec    1.00      4.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    108.4±0.62ms        ? ?/sec    1.49    161.6±0.62ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    124.3±0.64ms        ? ?/sec    1.68    209.3±1.12ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    202.2±0.97ms        ? ?/sec    2.41    486.6±2.50ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   433.0±12.51ms        ? ?/sec    1.14   493.3±10.74ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     42.1±0.36ms        ? ?/sec    1.36     57.4±0.55ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    107.6±0.50ms        ? ?/sec    1.53    164.5±0.93ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    107.2±0.48ms        ? ?/sec    1.52    162.6±1.03ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.00     64.0±0.44ms        ? ?/sec    1.01     64.5±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.00    118.9±0.68ms        ? ?/sec    1.43    170.2±2.12ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.00     92.7±0.56ms        ? ?/sec    1.10    102.3±0.53ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.00     38.8±0.41ms        ? ?/sec    1.00     38.7±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.00     48.0±0.42ms        ? ?/sec    1.00     48.1±0.44ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     48.3±0.37ms        ? ?/sec    1.08     52.3±0.33ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.02     40.0±0.26ms        ? ?/sec    1.00     39.4±0.23ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.01     14.3±0.07ms        ? ?/sec    1.00     14.1±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00   1804.8±8.57µs        ? ?/sec    1.22      2.2±0.00ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     12.4±0.07ms        ? ?/sec    1.09     13.5±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     14.2±0.06ms        ? ?/sec    1.08     15.4±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     24.4±0.73ms        ? ?/sec    1.69     41.3±0.40ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     35.4±0.37ms        ? ?/sec    1.53     54.3±0.40ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     34.3±0.33ms        ? ?/sec    1.54     52.7±0.31ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.02      4.3±0.01ms        ? ?/sec    1.00      4.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00    113.0±1.50ms        ? ?/sec    1.58    179.0±0.70ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    124.7±3.12ms        ? ?/sec    1.90    237.4±2.48ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    172.4±2.06ms        ? ?/sec    2.83    487.4±2.89ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00   356.1±11.22ms        ? ?/sec    1.23   439.1±14.62ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     39.5±0.56ms        ? ?/sec    1.39     54.9±0.54ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.00    100.1±4.19ms        ? ?/sec    1.56    155.9±0.87ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.00     96.3±0.48ms        ? ?/sec    1.59    152.9±1.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.00     61.8±0.45ms        ? ?/sec    1.02     62.9±0.37ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     1.00    108.0±0.96ms        ? ?/sec    1.48    159.3±1.17ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     1.00     87.6±0.70ms        ? ?/sec    1.08     95.0±0.41ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.00     31.3±0.27ms        ? ?/sec    1.01     31.5±0.45ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.00     33.6±0.23ms        ? ?/sec    1.03     34.6±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     44.7±0.45ms        ? ?/sec    1.09     48.8±0.23ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     36.8±0.22ms        ? ?/sec    1.00     36.5±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.02     13.5±0.08ms        ? ?/sec    1.00     13.3±0.04ms        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (f2b2c1b) to 0a4ffa5 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖: Benchmark completed

Details

group                                                                         alamb_cache_filter_result              main
-----                                                                         -------------------------              ----
filter context decimal128 (kept 1/2)                                          1.73     71.1±1.27µs        ? ?/sec    1.00     41.2±3.77µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     50.5±0.62µs        ? ?/sec    1.03     51.9±1.35µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.42    366.6±0.46ns        ? ?/sec    1.00    257.9±0.28ns        ? ?/sec
filter context f32 (kept 1/2)                                                 2.09    145.2±0.24µs        ? ?/sec    1.00     69.6±0.08µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.38     18.8±0.54µs        ? ?/sec    1.00     13.6±0.54µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.72    781.9±8.58ns        ? ?/sec    1.00    453.6±0.63ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.67     70.7±0.32µs        ? ?/sec    1.00     42.4±0.09µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.67     70.7±0.13µs        ? ?/sec    1.00     42.4±0.07µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.66     70.6±0.08µs        ? ?/sec    1.00     42.4±0.08µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.66     70.6±0.07µs        ? ?/sec    1.00     42.5±0.07µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.67     70.7±0.34µs        ? ?/sec    1.00     42.4±0.05µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.67     70.7±0.12µs        ? ?/sec    1.00     42.4±0.04µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.66     70.7±0.10µs        ? ?/sec    1.00     42.5±0.05µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.67     70.7±0.09µs        ? ?/sec    1.00     42.4±0.09µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.67     70.7±0.08µs        ? ?/sec    1.00     42.4±0.06µs        ? ?/sec
filter context i32 (kept 1/2)                                                 3.13     70.8±0.10µs        ? ?/sec    1.00     22.6±0.04µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.5±0.33µs        ? ?/sec    1.00      6.4±0.42µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.48    369.9±0.35ns        ? ?/sec    1.00    250.1±1.49ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         2.21    145.0±0.17µs        ? ?/sec    1.00     65.7±0.49µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.38     18.4±0.86µs        ? ?/sec    1.00     13.3±0.40µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.47    662.7±1.02ns        ? ?/sec    1.00    449.5±1.11ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   3.57    299.8±4.47µs        ? ?/sec    1.00     84.0±2.28µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            6.02    347.7±3.65µs        ? ?/sec    1.00     57.8±0.35µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                58.28    37.1±0.94µs        ? ?/sec    1.00    636.7±1.52ns        ? ?/sec
filter context short string view (kept 1/2)                                   2.37    202.6±6.85µs        ? ?/sec    1.00     85.4±6.08µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            2.48    144.3±2.58µs        ? ?/sec    1.00     58.2±1.10µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                73.63    34.2±0.36µs        ? ?/sec    1.00    464.1±0.57ns        ? ?/sec
filter context string (kept 1/2)                                              1.09   591.7±13.39µs        ? ?/sec    1.00   541.6±11.06µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   3.06     71.9±0.27µs        ? ?/sec    1.00     23.5±0.04µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.03      7.6±0.36µs        ? ?/sec    1.00      7.3±0.32µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.17    955.6±4.56ns        ? ?/sec    1.00    813.9±2.12ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           2.20    146.0±0.42µs        ? ?/sec    1.00     66.3±0.13µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.34     19.1±0.47µs        ? ?/sec    1.00     14.2±0.46µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.22   1262.2±6.05ns        ? ?/sec    1.00   1036.4±6.22ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.05    659.1±6.55µs        ? ?/sec    1.00   629.7±15.86µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.11   1163.7±6.77ns        ? ?/sec    1.00   1046.3±2.05ns        ? ?/sec
filter context u8 (kept 1/2)                                                  3.63     68.4±0.11µs        ? ?/sec    1.00     18.9±0.03µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.08  1982.0±12.73ns        ? ?/sec    1.00  1829.0±10.22ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.54    364.8±0.32ns        ? ?/sec    1.00    237.5±0.35ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          2.32    143.3±0.30µs        ? ?/sec    1.00     61.7±0.09µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.56     13.5±0.02µs        ? ?/sec    1.00      8.6±0.02µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.59    855.3±2.14ns        ? ?/sec    1.00    537.2±3.81ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.10    106.1±0.45µs        ? ?/sec    1.00     96.4±0.44µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.01     53.8±1.74µs        ? ?/sec    1.00     53.4±1.52µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.02      3.1±0.01µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    226.6±0.40µs        ? ?/sec    1.02    232.2±0.78µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.00    134.2±0.51µs        ? ?/sec    1.04    140.0±0.36µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.00     69.6±1.78µs        ? ?/sec    1.01     70.6±1.68µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.01      3.2±0.01µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.00    131.8±0.14µs        ? ?/sec    1.02    135.0±0.43µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.04     11.1±0.54µs        ? ?/sec    1.00     10.7±0.42µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.01µs        ? ?/sec    1.01      3.1±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00    181.3±2.07µs        ? ?/sec    1.04    188.4±6.88µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.12    221.0±8.77µs        ? ?/sec    1.00    198.2±6.30µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.01      3.2±0.01µs        ? ?/sec    1.00      3.2±0.01µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.12    102.8±0.41µs        ? ?/sec    1.00     92.0±0.10µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      8.7±0.37µs        ? ?/sec    1.00      8.7±0.46µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.01      3.1±0.01µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     86.6±0.20µs        ? ?/sec    1.00     85.8±0.22µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.05      2.8±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.8±0.01µs        ? ?/sec    1.00      2.8±0.01µs        ? ?/sec
filter run array (kept 1/2)                                                   1.08    389.3±1.20µs        ? ?/sec    1.00    359.4±2.47µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.16    360.1±1.95µs        ? ?/sec    1.00    310.7±1.57µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    246.6±0.86µs        ? ?/sec    1.00    247.0±0.86µs        ? ?/sec
filter single record batch                                                    1.28    118.1±0.47µs        ? ?/sec    1.00     92.6±0.08µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.14    104.4±1.84µs        ? ?/sec    1.00     91.9±0.08µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.08      4.0±0.02µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.03      3.1±0.01µs        ? ?/sec    1.00      3.0±0.03µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖: Benchmark completed

Looks like the filter code need some optimization. I will rerun to see if I can repeat it

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (f2b2c1b) to 0a4ffa5 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖: Benchmark completed

Details

group                                                                         alamb_cache_filter_result              main
-----                                                                         -------------------------              ----
filter context decimal128 (kept 1/2)                                          1.78     86.0±1.28µs        ? ?/sec    1.00     48.4±9.28µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.04     50.8±1.24µs        ? ?/sec    1.00     49.1±1.02µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.56    401.4±0.64ns        ? ?/sec    1.00    257.0±0.33ns        ? ?/sec
filter context f32 (kept 1/2)                                                 2.18    151.3±0.24µs        ? ?/sec    1.00     69.5±0.16µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.38     18.6±0.41µs        ? ?/sec    1.00     13.5±0.53µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.61    901.7±8.87ns        ? ?/sec    1.00    558.8±1.31ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.67     70.7±0.43µs        ? ?/sec    1.00     42.4±0.03µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.67     70.7±0.09µs        ? ?/sec    1.00     42.4±0.06µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.66     70.6±0.11µs        ? ?/sec    1.00     42.4±0.19µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.67     70.6±0.08µs        ? ?/sec    1.00     42.4±0.07µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.67     70.7±0.12µs        ? ?/sec    1.00     42.4±0.08µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.67     70.7±0.09µs        ? ?/sec    1.00     42.4±0.11µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.66     70.7±0.10µs        ? ?/sec    1.00     42.5±0.04µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.67     70.6±0.08µs        ? ?/sec    1.00     42.4±0.06µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.66     70.6±0.09µs        ? ?/sec    1.00     42.4±0.11µs        ? ?/sec
filter context i32 (kept 1/2)                                                 3.43     77.6±0.24µs        ? ?/sec    1.00     22.6±0.12µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.04      6.6±0.34µs        ? ?/sec    1.00      6.4±0.41µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.59    399.9±0.57ns        ? ?/sec    1.00    251.3±0.37ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         2.31    151.4±0.23µs        ? ?/sec    1.00     65.6±0.13µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.36     18.5±0.51µs        ? ?/sec    1.00     13.6±0.49µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.57    691.1±0.96ns        ? ?/sec    1.00    440.3±0.73ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   3.43    299.4±3.49µs        ? ?/sec    1.00     87.2±3.85µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            6.08    352.2±5.58µs        ? ?/sec    1.00     57.9±1.43µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                55.30    35.7±1.34µs        ? ?/sec    1.00    646.2±0.92ns        ? ?/sec
filter context short string view (kept 1/2)                                   2.48    200.0±5.99µs        ? ?/sec    1.00     80.7±0.56µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            2.58    149.3±8.41µs        ? ?/sec    1.00     57.9±2.37µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                73.82    34.4±0.99µs        ? ?/sec    1.00    466.7±0.68ns        ? ?/sec
filter context string (kept 1/2)                                              1.09   586.0±13.67µs        ? ?/sec    1.00   537.4±11.90µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   3.09     72.1±0.25µs        ? ?/sec    1.00     23.4±0.06µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.01      7.6±0.57µs        ? ?/sec    1.00      7.6±0.40µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.19    976.8±7.43ns        ? ?/sec    1.00    819.2±1.14ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           2.20    145.8±0.28µs        ? ?/sec    1.00     66.2±0.11µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.34     19.2±0.47µs        ? ?/sec    1.00     14.3±0.29µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.25   1300.6±8.98ns        ? ?/sec    1.00   1044.0±4.94ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   648.8±10.65µs        ? ?/sec    1.04   672.0±17.07µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.11  1090.9±10.47ns        ? ?/sec    1.00    982.7±1.69ns        ? ?/sec
filter context u8 (kept 1/2)                                                  4.01     75.7±0.12µs        ? ?/sec    1.00     18.9±0.03µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.11      2.0±0.01µs        ? ?/sec    1.00  1837.7±12.42ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.61    390.2±0.45ns        ? ?/sec    1.00    241.6±0.55ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          2.41    148.5±0.24µs        ? ?/sec    1.00     61.7±0.13µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.53     13.6±0.09µs        ? ?/sec    1.00      8.8±0.03µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.84    977.7±1.50ns        ? ?/sec    1.00    532.0±2.82ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.08    104.2±0.31µs        ? ?/sec    1.00     96.3±0.20µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.03     54.4±1.14µs        ? ?/sec    1.00     52.8±1.93µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.02      3.1±0.00µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    226.4±0.34µs        ? ?/sec    1.02    232.1±0.46µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.00    134.7±0.44µs        ? ?/sec    1.04    140.3±0.42µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.01     72.8±2.00µs        ? ?/sec    1.00     71.8±2.01µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.2±0.01µs        ? ?/sec    1.00      3.2±0.01µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.00    131.6±0.22µs        ? ?/sec    1.03    135.2±0.18µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     10.8±0.55µs        ? ?/sec    1.04     11.2±0.46µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.01µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.01    183.5±6.15µs        ? ?/sec    1.00    182.2±7.87µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    205.8±6.49µs        ? ?/sec    1.10    225.9±5.31µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.03      3.2±0.01µs        ? ?/sec    1.00      3.1±0.02µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.12    102.7±0.19µs        ? ?/sec    1.00     92.0±0.14µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.07      9.1±0.35µs        ? ?/sec    1.00      8.6±0.47µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.00      3.1±0.01µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     86.7±0.23µs        ? ?/sec    1.00     85.7±0.12µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.04      2.8±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.8±0.00µs        ? ?/sec    1.00      2.8±0.01µs        ? ?/sec
filter run array (kept 1/2)                                                   1.09    389.6±0.82µs        ? ?/sec    1.00    358.0±0.78µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.16    359.0±1.13µs        ? ?/sec    1.00    310.2±1.14µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    247.6±7.21µs        ? ?/sec    1.00    246.5±0.78µs        ? ?/sec
filter single record batch                                                    1.27    117.9±0.20µs        ? ?/sec    1.00     92.7±0.28µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.15    105.4±1.01µs        ? ?/sec    1.00     91.9±0.16µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.09      4.1±0.02µs        ? ?/sec    1.00      3.7±0.02µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.04      3.1±0.01µs        ? ?/sec    1.00      3.0±0.00µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (3374c03) to 0a4ffa5 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 28, 2025

🤖: Benchmark completed

Details

group                                                                         alamb_cache_filter_result              main
-----                                                                         -------------------------              ----
filter context decimal128 (kept 1/2)                                          1.71     72.0±2.25µs        ? ?/sec    1.00     42.1±3.49µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     51.0±1.64µs        ? ?/sec    1.00     50.9±1.11µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.33    352.4±0.28ns        ? ?/sec    1.00    265.9±0.27ns        ? ?/sec
filter context f32 (kept 1/2)                                                 2.07    144.5±0.19µs        ? ?/sec    1.00     69.8±0.19µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.29     17.3±0.46µs        ? ?/sec    1.00     13.4±0.38µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.39    772.2±1.18ns        ? ?/sec    1.00    556.1±0.59ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.67     70.6±0.08µs        ? ?/sec    1.00     42.4±0.12µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.67     70.6±0.09µs        ? ?/sec    1.00     42.4±0.06µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.66     70.7±0.14µs        ? ?/sec    1.00     42.4±0.07µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.66     70.7±0.07µs        ? ?/sec    1.00     42.5±0.05µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.67     70.7±0.11µs        ? ?/sec    1.00     42.4±0.07µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.67     70.7±0.12µs        ? ?/sec    1.00     42.4±0.08µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.66     70.7±0.08µs        ? ?/sec    1.00     42.5±0.07µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.67     70.6±0.11µs        ? ?/sec    1.00     42.4±0.06µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.67     70.7±0.14µs        ? ?/sec    1.00     42.4±0.04µs        ? ?/sec
filter context i32 (kept 1/2)                                                 3.12     71.8±0.35µs        ? ?/sec    1.00     23.0±0.53µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.06      6.7±0.47µs        ? ?/sec    1.00      6.3±0.27µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.36    352.3±0.43ns        ? ?/sec    1.00    258.5±0.72ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         2.21    145.0±0.42µs        ? ?/sec    1.00     65.5±0.10µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.26     17.4±0.43µs        ? ?/sec    1.00     13.8±0.57µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.50    673.4±0.55ns        ? ?/sec    1.00    449.1±0.51ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   7.35    679.9±3.50µs        ? ?/sec    1.00     92.5±7.66µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            5.62    334.1±4.83µs        ? ?/sec    1.00     59.4±2.48µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                57.87    37.5±1.24µs        ? ?/sec    1.00    647.8±0.66ns        ? ?/sec
filter context short string view (kept 1/2)                                   6.10    588.8±6.15µs        ? ?/sec    1.00     96.5±4.12µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            2.65    151.7±6.64µs        ? ?/sec    1.00     57.2±1.83µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                78.77    36.8±1.51µs        ? ?/sec    1.00    466.8±0.38ns        ? ?/sec
filter context string (kept 1/2)                                              1.04   574.1±10.97µs        ? ?/sec    1.00   550.3±14.13µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   3.36     78.5±0.26µs        ? ?/sec    1.00     23.4±0.06µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.3±0.38µs        ? ?/sec    1.01      7.4±0.31µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.15    947.6±1.33ns        ? ?/sec    1.00    821.9±1.22ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           2.29    152.0±0.42µs        ? ?/sec    1.00     66.4±0.10µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.25     18.0±0.44µs        ? ?/sec    1.00     14.4±0.46µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.24   1275.0±2.93ns        ? ?/sec    1.00   1024.8±3.51ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.05   647.8±13.77µs        ? ?/sec    1.00   619.0±11.27µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00    926.4±7.03ns        ? ?/sec    1.26   1163.1±5.12ns        ? ?/sec
filter context u8 (kept 1/2)                                                  3.72     70.3±0.25µs        ? ?/sec    1.00     18.9±0.03µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.07  1977.6±14.43ns        ? ?/sec    1.00  1849.5±10.23ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.41    346.6±0.51ns        ? ?/sec    1.00    245.6±0.45ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          2.32    143.3±0.31µs        ? ?/sec    1.00     61.7±0.10µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.41     12.4±0.03µs        ? ?/sec    1.00      8.8±0.02µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.64    896.2±1.60ns        ? ?/sec    1.00    546.0±7.43ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.06    102.7±0.36µs        ? ?/sec    1.00     97.0±0.85µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.00     52.5±1.59µs        ? ?/sec    1.03     53.8±1.71µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      2.4±0.00µs        ? ?/sec    1.26      3.0±0.02µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    222.8±0.47µs        ? ?/sec    1.04    232.1±0.52µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.00    133.4±0.44µs        ? ?/sec    1.05    139.9±0.37µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.03     71.5±3.47µs        ? ?/sec    1.00     69.5±1.98µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      2.5±0.01µs        ? ?/sec    1.25      3.2±0.00µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.00    129.7±0.19µs        ? ?/sec    1.04    134.8±0.17µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.02     11.5±0.51µs        ? ?/sec    1.00     11.3±0.71µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      2.4±0.00µs        ? ?/sec    1.26      3.1±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.03    193.8±8.66µs        ? ?/sec    1.00    187.3±9.88µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    205.7±6.17µs        ? ?/sec    1.02    209.1±5.13µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      2.5±0.01µs        ? ?/sec    1.26      3.2±0.01µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.10    101.4±0.33µs        ? ?/sec    1.00     92.1±0.16µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.08      9.0±0.43µs        ? ?/sec    1.00      8.3±0.30µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.00      2.4±0.01µs        ? ?/sec    1.27      3.1±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.00     81.7±0.18µs        ? ?/sec    1.05     85.7±0.19µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.17      3.1±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.1±0.01µs        ? ?/sec    1.31      2.8±0.00µs        ? ?/sec
filter run array (kept 1/2)                                                   1.25    449.6±2.05µs        ? ?/sec    1.00    358.6±0.75µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.32    410.5±1.29µs        ? ?/sec    1.00    310.8±1.29µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.27    314.5±0.77µs        ? ?/sec    1.00    247.7±0.80µs        ? ?/sec
filter single record batch                                                    1.26    116.3±0.16µs        ? ?/sec    1.00     92.6±0.08µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.10    100.9±0.25µs        ? ?/sec    1.00     92.0±0.13µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.11      4.1±0.01µs        ? ?/sec    1.00      3.7±0.02µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      2.5±0.01µs        ? ?/sec    1.21      3.0±0.01µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented May 29, 2025

🤖: Benchmark completed

Looks like the filter code need some optimization. I will rerun to see if I can repeat it

It seems it can be repeated to reproduce. And mostly only happen for filter context

But the clickbench benchmark is great.

if let Some((null_count, nulls)) = filter_null_mask(array.nulls(), predicate) {
builder = builder.null_count(null_count).null_bit_buffer(Some(nulls));
}
let builder = PrimitiveBuilder::<T>::with_capacity(predicate.count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to remove/change those implementations?
I might be pretty hard to match the performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is not necessary to change the original implementations -- I reworked the existing filter kernels in this PR so that I reuse the existing tests and have confidence in the correctness of the approach.

IterationStrategy::Indices(indices) => {
append_filtered_nulls(null_buffer_builder, array, predicate);
let iter = indices.iter().map(|x| values[*x]);
for v in iter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be slow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- I can find a way to make it faster (direct access to the underlying buffer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 15541d2

@alamb
Copy link
Contributor Author

alamb commented May 29, 2025

My analysis of benchmark results is:

  1. Many of the filter kernel benchmarks are dominated by the time required to construct the final Array (as the actual filtering operation is so fast) so adding a Builder in that path slows them down. I am not sure how important this would be in an actual system where filtering is almost followed by coalesce.
  2. The reason StringView is so much slower is that I have it automatically coalescing values, which is important for actual systems but clearly a major overhead in the microbenchmarks

/// 3. `take`-n: a subset of the input array is selected based on the indices provided in a `UInt32Array` or similar.
///
/// This structure handles multiple arrays
pub struct IncrementalRecordBatchBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am particularly excited about this structure as I think it is exactly what we need in the DataFusion filter exec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, very nice to append filter to apply to batch!

@alamb
Copy link
Contributor Author

alamb commented May 29, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (0b3a52a) to 0a4ffa5 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

alamb added a commit to alamb/datafusion that referenced this pull request May 29, 2025
@alamb
Copy link
Contributor Author

alamb commented May 29, 2025

🤖: Benchmark completed

Details

group                                                                         alamb_cache_filter_result              main
-----                                                                         -------------------------              ----
filter context decimal128 (kept 1/2)                                          1.00     53.2±5.12µs        ? ?/sec    1.06     56.6±4.76µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     49.5±0.59µs        ? ?/sec    1.02     50.2±1.17µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.22    314.9±1.11ns        ? ?/sec    1.00    258.5±0.32ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.95    135.6±0.18µs        ? ?/sec    1.00     69.6±0.15µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.37     18.8±0.42µs        ? ?/sec    1.00     13.7±0.45µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.43    648.3±0.72ns        ? ?/sec    1.00    453.5±3.36ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.68     71.1±0.11µs        ? ?/sec    1.00     42.4±0.06µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.67     71.0±0.07µs        ? ?/sec    1.00     42.4±0.07µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.67     70.9±0.18µs        ? ?/sec    1.00     42.4±0.07µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.67     70.9±0.08µs        ? ?/sec    1.00     42.4±0.08µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.67     71.0±0.26µs        ? ?/sec    1.00     42.4±0.05µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.67     71.0±0.27µs        ? ?/sec    1.00     42.4±0.09µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.67     71.0±0.10µs        ? ?/sec    1.00     42.5±0.08µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.67     70.9±0.09µs        ? ?/sec    1.00     42.4±0.05µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.67     70.9±0.10µs        ? ?/sec    1.00     42.4±0.12µs        ? ?/sec
filter context i32 (kept 1/2)                                                 2.34     53.0±0.08µs        ? ?/sec    1.00     22.6±0.07µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.05      6.7±0.33µs        ? ?/sec    1.00      6.4±0.48µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.38    340.7±1.18ns        ? ?/sec    1.00    247.7±0.34ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.94    127.1±0.52µs        ? ?/sec    1.00     65.5±0.17µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.39     18.6±0.42µs        ? ?/sec    1.00     13.4±0.43µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.16    630.0±1.18ns        ? ?/sec    1.00    542.8±1.17ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.34    126.4±3.32µs        ? ?/sec    1.00     94.6±5.57µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.08     63.1±1.29µs        ? ?/sec    1.00     58.2±1.18µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    592.4±0.86ns        ? ?/sec    1.09    648.4±1.33ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.42    126.1±3.74µs        ? ?/sec    1.00     88.8±7.69µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.05     62.4±1.17µs        ? ?/sec    1.00     59.4±0.35µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.12    522.6±0.85ns        ? ?/sec    1.00    466.0±0.57ns        ? ?/sec
filter context string (kept 1/2)                                              1.08   590.4±12.15µs        ? ?/sec    1.00    544.7±9.12µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   2.31     53.8±0.16µs        ? ?/sec    1.00     23.3±0.08µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.2±0.35µs        ? ?/sec    1.04      7.5±0.45µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.13    919.6±6.67ns        ? ?/sec    1.00    816.7±2.12ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.92    127.9±0.22µs        ? ?/sec    1.00     66.4±0.11µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.38     19.4±0.52µs        ? ?/sec    1.00     14.0±0.45µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.20   1236.3±6.21ns        ? ?/sec    1.00   1028.5±2.61ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   651.1±14.66µs        ? ?/sec    1.02   666.2±10.86µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00    912.3±6.12ns        ? ?/sec    1.14   1041.3±6.28ns        ? ?/sec
filter context u8 (kept 1/2)                                                  3.27     61.7±0.13µs        ? ?/sec    1.00     18.9±0.03µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00   1984.7±5.73ns        ? ?/sec    1.04      2.1±0.01µs        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.48    352.8±0.34ns        ? ?/sec    1.00    238.1±1.08ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          2.19    135.4±0.25µs        ? ?/sec    1.00     61.7±0.09µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.61     13.8±0.03µs        ? ?/sec    1.00      8.6±0.03µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.69    744.8±1.58ns        ? ?/sec    1.00    441.6±4.65ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.16    111.5±0.40µs        ? ?/sec    1.00     96.5±0.99µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.00     52.6±1.37µs        ? ?/sec    1.03     54.3±1.19µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      2.5±0.01µs        ? ?/sec    1.20      3.0±0.03µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    207.6±0.55µs        ? ?/sec    1.12    232.0±0.43µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.00    132.6±1.43µs        ? ?/sec    1.06    140.1±0.48µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.00     69.1±0.65µs        ? ?/sec    1.06     73.4±1.39µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      2.5±0.01µs        ? ?/sec    1.27      3.2±0.01µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.00    129.6±0.42µs        ? ?/sec    1.04    134.8±0.23µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.04     11.3±0.63µs        ? ?/sec    1.00     10.9±0.54µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      2.5±0.01µs        ? ?/sec    1.23      3.1±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00    184.3±7.92µs        ? ?/sec    1.01    185.8±7.65µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    214.4±6.28µs        ? ?/sec    1.00    215.1±5.03µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      2.5±0.00µs        ? ?/sec    1.26      3.1±0.00µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.18    108.6±0.17µs        ? ?/sec    1.00     92.0±0.14µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.07      9.1±0.37µs        ? ?/sec    1.00      8.5±0.44µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.00      2.5±0.01µs        ? ?/sec    1.24      3.1±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.07     91.5±0.23µs        ? ?/sec    1.00     85.6±0.11µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.17      3.1±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.3±0.01µs        ? ?/sec    1.24      2.8±0.01µs        ? ?/sec
filter run array (kept 1/2)                                                   1.09    390.2±1.50µs        ? ?/sec    1.00    357.9±0.57µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.17    361.1±1.94µs        ? ?/sec    1.00    309.7±0.96µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    246.8±0.86µs        ? ?/sec    1.00    246.7±1.21µs        ? ?/sec
filter single record batch                                                    1.08    100.5±0.23µs        ? ?/sec    1.00     92.7±0.20µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.17    107.5±0.27µs        ? ?/sec    1.00     91.9±0.09µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.08      4.1±0.02µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      2.4±0.01µs        ? ?/sec    1.24      3.0±0.01µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented May 29, 2025

The benchmarks look better now -- thank you @Dandandan . I will continue to obsess over them after verifying this PR helps with DataFusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants