Skip to content

Conversation

pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Oct 3, 2025

Which issue does this PR close?

Improvement in the context of #18075

Rationale for this change

Speculative performance improvements for case evaluation

What changes are included in this PR?

Short circuit case evaluation loop when as soon as a value has been calculated for each input rows

Are these changes tested?

(Hopefully) covered by SQL logic tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Oct 3, 2025
@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 3, 2025

@alamb could you trigger a benchmark run on this PR (once CI gives the green light)?

@alamb
Copy link
Contributor

alamb commented Oct 3, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing case_improvements (3b75814) to 10a437b diff
BENCH_NAME=case_when
BENCH_COMMAND=cargo bench --bench case_when
BENCH_FILTER=
BENCH_BRANCH_NAME=case_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 3, 2025

🤖: Benchmark completed

Details

group                          case_improvements                      main
-----                          -----------------                      ----
case_when: CASE expr           1.00     23.7±0.23µs        ? ?/sec    1.02     24.2±0.24µs        ? ?/sec
case_when: column or null      1.00   1401.1±2.72ns        ? ?/sec    1.01   1409.5±6.15ns        ? ?/sec
case_when: expr or expr        1.00     30.7±0.21µs        ? ?/sec    1.03     31.6±0.22µs        ? ?/sec
case_when: scalar or scalar    1.00      8.0±0.01µs        ? ?/sec    1.02      8.1±0.02µs        ? ?/sec

@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 3, 2025

Small improvement (which is what I had expected). WDYT, worth integrating?
Could just be noise too though. The 'expr or expr' and 'scalar or scalar' code paths were not modified.

Edit: I had a closer look at the case_when benchmark. It's probably too much of a microbenchmark to see any difference there. It does show that the extra conditionals don't tank performance.

The example I'm looking at locally is a projection where a classification category is being associated with each row using a large case expression. Something like
SELECT *, CASE WHEN predicate_1 THEN 0 WHEN predicate_2 THEN 1 WHEN predicate_2 THEN 2 ... END FROM table. If my reading of the code is correct, there's much more data being manipulated in a scenario like that.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it is worth pursing to me -- thanks @pepijnve

I wonder if we can factor out the pattern of "don't call evaluate_selection if the selection is still the entire remainder (mostly so we can document the behavior / make sure this it doesn't get lost in some future refactoring)

Maybe something we could wrap the count in some structure or a function

🤔

let mut remainder = Remainder::new(batch.num_rows());
remainder.update(&self.when_then_expr[i], &batch);

let mut current_value = new_null_array(&return_type, batch.num_rows());
// We only consider non-null values while comparing with whens
let mut remainder = not(&base_nulls)?;
let mut remainder_count = remainder.true_count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have found in past evaluations, that the code generated for true_count is astonishingly fast (it uses some special hardware instruction) so I am not surprised this works well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know that that’s fairly cheap.

What I’m experimenting with is retaining the filtered record batch from one loop iteration to the next so that the amount of data to be churned through each iteration shrinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave the more complex optimisation work for another PR. The short circuit logic is already useful on it's own.

@alamb
Copy link
Contributor

alamb commented Oct 3, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing case_improvements (3b75814) to 10a437b diff
BENCH_NAME=case_when
BENCH_COMMAND=cargo bench --bench case_when
BENCH_FILTER=
BENCH_BRANCH_NAME=case_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 3, 2025

🤖: Benchmark completed

Details

group                          case_improvements                      main
-----                          -----------------                      ----
case_when: CASE expr           1.00     24.0±0.10µs        ? ?/sec    1.00     23.9±0.14µs        ? ?/sec
case_when: column or null      1.00   1411.3±1.41ns        ? ?/sec    1.00   1411.1±5.67ns        ? ?/sec
case_when: expr or expr        1.01     31.2±0.25µs        ? ?/sec    1.00     31.0±0.28µs        ? ?/sec
case_when: scalar or scalar    1.00      8.0±0.02µs        ? ?/sec    1.00      8.0±0.02µs        ? ?/sec

@pepijnve pepijnve changed the title Case evaluation improvements Short circuit complex case evaluation modes as soon as possible Oct 15, 2025
@pepijnve pepijnve marked this pull request as ready for review October 15, 2025 14:46
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 15, 2025
@pepijnve
Copy link
Contributor Author

The expr_is_restrict_null_predicate caught a bug I had introduced that was not caught by the SLTs. I've added extra SLTs for case to plug the coverage hole.

@alamb
Copy link
Contributor

alamb commented Oct 15, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing case_improvements (da76c30) to 057583d diff
BENCH_NAME=case_when
BENCH_COMMAND=cargo bench --bench case_when
BENCH_FILTER=
BENCH_BRANCH_NAME=case_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 15, 2025

🤖: Benchmark completed

Details

group                          case_improvements                      main
-----                          -----------------                      ----
case_when: CASE expr           1.01     23.3±0.10µs        ? ?/sec    1.00     23.1±0.18µs        ? ?/sec
case_when: column or null      1.00   1430.0±2.39ns        ? ?/sec    1.00  1434.4±12.61ns        ? ?/sec
case_when: expr or expr        1.00     31.1±0.11µs        ? ?/sec    1.00     31.1±0.12µs        ? ?/sec
case_when: scalar or scalar    1.01      8.3±0.07µs        ? ?/sec    1.00      8.2±0.08µs        ? ?/sec

@pepijnve
Copy link
Contributor Author

I’ll have a look at the existing micro benchmarks tomorrow. Not sure if there’s anything in there already with sufficient branches that you would notice the impact.

@pepijnve
Copy link
Contributor Author

I had already done the work of adding extra benchmarks in another branch. I've moved that over to #18097 which is now just an extension of the micro benchmarks. That'll make it easier to compare results.

@alamb
Copy link
Contributor

alamb commented Oct 17, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing case_improvements (558ad59) to ec3d20b diff
BENCH_NAME=case_when
BENCH_COMMAND=cargo bench --bench case_when
BENCH_FILTER=
BENCH_BRANCH_NAME=case_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 17, 2025

🤖: Benchmark completed

Details

group                                                                                                             case_improvements                      main
-----                                                                                                             -----------------                      ----
case_when 8192x100: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE n + 1 END    1.01       2.6±0.08s        ? ?/sec    1.00       2.6±0.05s        ? ?/sec
case_when 8192x100: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END                                                         1.00     55.4±0.16µs        ? ?/sec    1.00     55.2±0.08µs        ? ?/sec
case_when 8192x100: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END                                                       1.01   478.5±13.92µs        ? ?/sec    1.00    472.1±6.25µs        ? ?/sec
case_when 8192x100: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END                                                   1.00      6.7±0.01µs        ? ?/sec    1.00      6.7±0.01µs        ? ?/sec
case_when 8192x100: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 END           1.00       3.1±0.02s        ? ?/sec    1.00       3.1±0.01s        ? ?/sec
case_when 8192x100: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END                                                     1.00    458.6±7.40µs        ? ?/sec    1.05   481.7±15.12µs        ? ?/sec
case_when 8192x3: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE n + 1 END      1.01    489.8±3.82ms        ? ?/sec    1.00    484.6±4.84ms        ? ?/sec
case_when 8192x3: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END                                                           1.00     55.3±0.24µs        ? ?/sec    1.00     55.3±0.28µs        ? ?/sec
case_when 8192x3: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END                                                         1.03    104.6±1.02µs        ? ?/sec    1.00    102.0±0.39µs        ? ?/sec
case_when 8192x3: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END                                                     1.00      6.6±0.01µs        ? ?/sec    1.00      6.6±0.01µs        ? ?/sec
case_when 8192x3: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 END             1.00    558.9±1.86ms        ? ?/sec    1.00    557.5±2.35ms        ? ?/sec
case_when 8192x3: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END                                                       1.01    115.2±0.45µs        ? ?/sec    1.00    114.2±0.48µs        ? ?/sec
case_when 8192x50: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE n + 1 END     1.00  1434.2±33.04ms        ? ?/sec    1.00  1433.9±28.82ms        ? ?/sec
case_when 8192x50: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END                                                          1.00     55.3±0.12µs        ? ?/sec    1.00     55.3±0.10µs        ? ?/sec
case_when 8192x50: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END                                                        1.02    263.4±3.35µs        ? ?/sec    1.00    259.1±4.86µs        ? ?/sec
case_when 8192x50: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END                                                    1.01      6.7±0.02µs        ? ?/sec    1.00      6.6±0.01µs        ? ?/sec
case_when 8192x50: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 END            1.00   1782.6±7.70ms        ? ?/sec    1.00   1787.7±8.00ms        ? ?/sec
case_when 8192x50: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END                                                      1.03    275.1±5.26µs        ? ?/sec    1.00    267.7±2.42µs        ? ?/sec

@pepijnve
Copy link
Contributor Author

Going to take a closer look at this. In one of the benches I would expect a more significant difference.

@pepijnve
Copy link
Contributor Author

🤦‍♂️ I botched the benchmark. Operator::Eq in the predicates instead of Operator::LtEq...

@pepijnve
Copy link
Contributor Author

So sorry about that @alamb. #18144 makes the benchmark actually do what it's supposed to do.

@alamb
Copy link
Contributor

alamb commented Oct 18, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing case_improvements (3f0eaa3) to 93f136c diff
BENCH_NAME=case_when
BENCH_COMMAND=cargo bench --bench case_when
BENCH_FILTER=
BENCH_BRANCH_NAME=case_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 18, 2025

🤖: Benchmark completed

Details

group                                                                                                             case_improvements                      main
-----                                                                                                             -----------------                      ----
case_when 8192x100: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE n + 1 END    1.00      3.6±0.01ms        ? ?/sec    5.77     20.5±0.34ms        ? ?/sec
case_when 8192x100: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END                                                         1.00     55.7±0.09µs        ? ?/sec    1.02     56.6±0.42µs        ? ?/sec
case_when 8192x100: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END                                                       1.02    376.7±9.66µs        ? ?/sec    1.00    370.3±2.07µs        ? ?/sec
case_when 8192x100: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END                                                   1.00      6.7±0.02µs        ? ?/sec    1.01      6.8±0.02µs        ? ?/sec
case_when 8192x100: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 END           1.00       2.8±0.01s        ? ?/sec    1.00       2.8±0.02s        ? ?/sec
case_when 8192x100: CASE c1 WHEN 0 THEN 0 WHEN 1 THEN 1 ... WHEN n THEN n ELSE n + 1 END                          1.00       2.8±0.01s        ? ?/sec    1.00       2.8±0.01s        ? ?/sec
case_when 8192x100: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END                                                     1.00    370.7±8.05µs        ? ?/sec    1.01    374.7±8.37µs        ? ?/sec
case_when 8192x100: CASE c2 WHEN 0 THEN 0 WHEN 1000 THEN 1 ... WHEN n * 1000 THEN n ELSE n + 1 END                1.00      3.5±0.02ms        ? ?/sec    20.04    71.1±0.40ms        ? ?/sec
case_when 8192x3: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE n + 1 END      1.00    419.7±3.13µs        ? ?/sec    36.65    15.4±0.12ms        ? ?/sec
case_when 8192x3: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END                                                           1.00     55.2±0.15µs        ? ?/sec    1.02     56.3±0.11µs        ? ?/sec
case_when 8192x3: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END                                                         1.02     23.0±0.36µs        ? ?/sec    1.00     22.4±0.33µs        ? ?/sec
case_when 8192x3: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END                                                     1.01      6.8±0.02µs        ? ?/sec    1.00      6.7±0.01µs        ? ?/sec
case_when 8192x3: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 END             1.02    259.9±0.68ms        ? ?/sec    1.00    255.9±0.83ms        ? ?/sec
case_when 8192x3: CASE c1 WHEN 0 THEN 0 WHEN 1 THEN 1 ... WHEN n THEN n ELSE n + 1 END                            1.01    238.0±0.55ms        ? ?/sec    1.00    234.8±0.82ms        ? ?/sec
case_when 8192x3: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END                                                       1.05     32.2±0.50µs        ? ?/sec    1.00     30.5±0.28µs        ? ?/sec
case_when 8192x3: CASE c2 WHEN 0 THEN 0 WHEN 1000 THEN 1 ... WHEN n * 1000 THEN n ELSE n + 1 END                  1.00    385.9±3.18µs        ? ?/sec    169.08    65.2±0.26ms        ? ?/sec
case_when 8192x50: CASE WHEN c1 < 0 THEN 0 WHEN c1 < 1000 THEN 1 ... WHEN c1 < n * 1000 THEN n ELSE n + 1 END     1.00  1885.7±17.69µs        ? ?/sec    9.95     18.8±0.31ms        ? ?/sec
case_when 8192x50: CASE WHEN c1 <= 500 THEN 1 ELSE 0 END                                                          1.00     55.6±0.14µs        ? ?/sec    1.01     56.4±0.10µs        ? ?/sec
case_when 8192x50: CASE WHEN c1 <= 500 THEN c2 ELSE c3 END                                                        1.04    171.8±5.61µs        ? ?/sec    1.00    165.8±2.92µs        ? ?/sec
case_when 8192x50: CASE WHEN c1 <= 500 THEN c2 [ELSE NULL] END                                                    1.00      6.8±0.01µs        ? ?/sec    1.00      6.8±0.02µs        ? ?/sec
case_when 8192x50: CASE WHEN c1 == 0 THEN 0 WHEN c1 == 1 THEN 1 ... WHEN c1 == n THEN n ELSE n + 1 END            1.00   1459.3±7.66ms        ? ?/sec    1.00   1453.8±6.30ms        ? ?/sec
case_when 8192x50: CASE c1 WHEN 0 THEN 0 WHEN 1 THEN 1 ... WHEN n THEN n ELSE n + 1 END                           1.00   1432.8±7.46ms        ? ?/sec    1.00   1428.1±6.66ms        ? ?/sec
case_when 8192x50: CASE c1 WHEN 1 THEN c2 WHEN 2 THEN c3 END                                                      1.07    185.5±4.28µs        ? ?/sec    1.00    173.4±1.21µs        ? ?/sec
case_when 8192x50: CASE c2 WHEN 0 THEN 0 WHEN 1000 THEN 1 ... WHEN n * 1000 THEN n ELSE n + 1 END                 1.00  1861.4±10.71µs        ? ?/sec    37.20    69.2±0.30ms        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Oct 18, 2025

Screenshot 2025-10-18 at 9 26 08 AM

@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 18, 2025

Those are the gains I was expecting to see! 🚀 Just to keep our feet solidly on the ground, these are somewhat contrived and extreme examples. But I'll take it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the diligence @pepijnve

// start with nulls as default output
let mut current_value = new_null_array(&return_type, batch.num_rows());
let mut remainder = BooleanArray::from(vec![true; batch.num_rows()]);
let mut remainder_count = batch.num_rows();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the similarity of this code and the one above, I wonder if there is some way to avoid the duplication (as part of a follow on PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #18152 the code is changing a bit further. If the approach there pans out I want to try to do the same for case_when_with_expr. It'll be easier to see if there's an extractable pattern once that work settles down, so if it's ok with you I would like to postpone your suggestion for a little bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, of course -- we can always make the code better as follow on PRs

@alamb alamb added the performance Make DataFusion faster label Oct 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Make DataFusion faster physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants