Skip to content

Conversation

pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Sep 28, 2025

Which issue does this PR close?

Rationale for this change

#17357 introduced a change that replaces coalesce function calls with case expressions. In the current implementation these two differ in the way they report their nullability. coalesce is more precise than case all will report itself as not nullable in situations where the equivalent case does report being nullable.

The rest of the codebase currently does not expect the nullability property of an expression to change as a side effect of expression simplification. This PR is a first attempt to align the nullability of coalesce and case.

What changes are included in this PR?

Tweaks to the nullable logic for the logical and physical case expression code to report case as being not nullable in more situations.

Are these changes tested?

Additional unit tests have been added to test the new logic.

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Sep 28, 2025
@pepijnve pepijnve marked this pull request as ready for review September 29, 2025 10:52
@pepijnve pepijnve force-pushed the issue_17801 branch 3 times, most recently from 4bbaa82 to 7f8d7cf Compare September 29, 2025 12:56
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @pepijnve --

I am not quite sure about this implementation (I am hoping #17628 might solve the problem too with more sophisticated case folding)

However, I verified it does solve the problem with running the benchmarks so from that perspective I think we should proceed

My only real concern is that the newly added tests cover only the new code, and not the "end to end" behavior you tracked down (namely that the case pattern with coalesce changes the nullability).

Would it be possible to add some of the cases as expr simplification tests too? Somewhere like here?

Comment on lines 924 to 925
when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo"))
.otherwise(lit(0))?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: You can probably make this more concise using the eq method, something like this:

Suggested change
when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo"))
.otherwise(lit(0))?,
when(col("foo").eq(lit(5))), col("foo")).otherwise(lit(0))?,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise there is Expr::and for ands that could be used as well below

However, the current setup of using and as a prefix is pretty clear too, so maybe what you have here is actually more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed that. I was looking for prefix versions, and hadn't realised infix ones existed too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up sticking with prefix notation for the boolean combinators and infix for the rest. Using infix for the boolean made it hard to read. I've also added the SQL equivalent as a comment.

assert!(expr.nullable(&get_schema(false)).unwrap());
}

fn check_nullability(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a little confusing at first, because it makes an explicit assumption that expr's will never introduce nulls (in order for !expr.nullable(&get_schema(false))?, to be true). So for example, it wouldn't do the right thing with the NULLIF function NULLIF(foo, 25) or something

Maybe some comments would help

Suggested change
fn check_nullability(
/// Verifies that `expr` has `nullable` nullability when the 'foo' column is
/// null.
/// Also assumes and verifies that `expr` is NOT nullable when 'foo' is NOT null
fn check_nullability(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked the logical plan test cases already to (hopefully) make it more obvious what's going on. I hadn't given this function much thought since it was only a test thing.

check_nullability(
when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo"))
.otherwise(lit(0))?,
true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically this could also be reported as false, given that if foo is null, then the expr resolves to 0 (non null)

> create table t(foo int) as values (0), (NULL), (5);
0 row(s) fetched.
Elapsed 0.001 seconds.

> select foo, CASE WHEN foo=5 THEN foo ELSE 0 END from t;
+------+---------------------------------------------------------+
| foo  | CASE WHEN t.foo = Int64(5) THEN t.foo ELSE Int64(0) END |
+------+---------------------------------------------------------+
| 0    | 0                                                       |
| NULL | 0                                                       |
| 5    | 5                                                       |
+------+---------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.002 seconds.

However, maybe we can improve that in a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the const evaluation is far from complete. I tried to do something good enough for the coalesce simplification initially.
I was wondering the whole time if there isn't some existing null analysis logic somewhere in the codebase we could reuse. The best I could come up with is rewriting the full expression by replacing the then expression with literal NULL and then attempting const evaluation. But that got me worrying about planning overhead again.

when(
or(
is_not_null(col("foo")),
binary_expr(col("foo"), Operator::Eq, lit(5)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, I don't think this expression can everr be true so this overall expression is still non nullable

col("foo"),
)
.otherwise(lit(0))?,
true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too -- this expression is not nullabile

> select foo, CASE WHEN foo=5 OR foo IS NOT NULL THEN foo ELSE 0 END from t;
+------+------------------------------------------------------------------------------+
| foo  | CASE WHEN t.foo = Int64(5) OR t.foo IS NOT NULL THEN t.foo ELSE Int64(0) END |
+------+------------------------------------------------------------------------------+
| 0    | 0                                                                            |
| NULL | 0                                                                            |
| 5    | 5                                                                            |
+------+------------------------------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.002 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, if you comment out the filter step (i.e. revert to the pre-patch version) all of these cases are reported as being nullable. The scope of this PR is to get at least some cases that are definitely not nullable reported as such, not ensure all cases are reported correctly.

.otherwise(lit(0))?,
true,
get_schema,
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please add a check with is_null in the OR clause (which should be null)

Something like the equivalent to

> select foo, CASE WHEN foo=5 OR foo IS NULL THEN foo ELSE 0 END from t;
+------+--------------------------------------------------------------------------+
| foo  | CASE WHEN t.foo = Int64(5) OR t.foo IS NULL THEN t.foo ELSE Int64(0) END |
+------+--------------------------------------------------------------------------+
| 0    | 0                                                                        |
| NULL | NULL                                                                     |
| 5    | 5                                                                        |
+------+--------------------------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.000 seconds.

Like

      check_nullability(
            when(
                or(
                    binary_expr(col("foo"), Operator::Eq, lit(5)),
                    is_null(col("foo")),
                ),
                col("foo"),
            )
            .otherwise(lit(0))?,
            true,
            get_schema,
        )?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this test case

@pepijnve
Copy link
Contributor Author

pepijnve commented Sep 29, 2025

I am not quite sure about this implementation (I am hoping #17628 might solve the problem too with more sophisticated case folding)

I warned you it wasn't very elegant. 😄 I don't think #17628 covers the same thing though. What we're trying to do here is get ExprSchemable::nullable to report an accurate value outside of the optimiser. Ideally you would want that to work both before and after case folding.
I agree that there is a lot of overlap in the required logic though. If you were to take the case expression, replace the then expression trees with literal NULL everywhere they occur, then perform case folding, and then perform null analysis then you would get the same result.

My only real concern is that the newly added tests cover only the new code, and not the "end to end" behavior you tracked down (namely that the case pattern with coalesce changes the nullability).

Would it be possible to add some of the cases as expr simplification tests too? Somewhere like here?

I'm not sure what kind of test you have in mind. The end to end case is (admittedly very indirectly) covered by TPC-DS query 75 and the removal of the double optimisation. If you revert the production code change in this PR, but keep the test change you'll see that it fails.

For the simplifier itself, I was wondering if there shouldn't be some internal assertions that verifies that the result of calling ScalarUDF::simplify doesn't change key aspects of the expression that the schema also encodes. Both the data type and the nullability should not change since that causes the expression and the schema to be mismatched.

@pepijnve
Copy link
Contributor Author

pepijnve commented Sep 30, 2025

@alamb thinking about this a bit more. I'm going to struggle expressing myself sufficiently clearly here, but I'll try to explain the idea behind what I'm doing. Maybe that can help us figure out a better way to express the idea.

What I'm trying to do is improve the accuracy of the predicate is_nullable(expr) specifically for CASE expressions.
In the current code this predicate is implemented by checking (in pseudo code) case_expr.when_then.any?((when, then) -> is_nullable(then)) || is_nullable(case_expr.else). This results in quite a few CASE expressions being reported as nullable even though they're not.

In particular there's one interesting case (pun not intended) which results from the coalesce simplification and that is CASE WHEN x IS NOT NULL THEN x ELSE y. If the expression x is nullable, is_nullable will currently report the entire case expression as nullable. The implementation does not take into account that there is a guard clause preventing the null value of x from being returned.

What I attempted to do in this PR is to look at the more general form WHEN f(x) THEN x where f(x) is some arbitrary predicate that may or may not depend on the value of x. What the code is trying to do is to const evaluate f(NULL). If it can with 100% certainty (i.e. Some is returned) and the evaluated value is false, then predicate f guarantees this particular branch of the case expression from ever returning a NULL and we can ignore the nullability of x.

I tried to implement this in a cheap, but imprecise way. My rationale was that even though it's not perfect, it's an improvement in accuracy over the current code.
A possible alternative would be to rewrite the expression corresponding to f using the binding x = null and then attempting to const evaluate using the existing const evaluation code. That introduces a dependency from the logical expression module to the optimiser though and would probably have a longer run time than the current crude approximation.

@pepijnve
Copy link
Contributor Author

I've massaged the logical plan version of the code a bit further already to hopefully clarify what it's doing. I then ran the test cases with logging output rather than assertions before and after the extra filtering to illustrate what's being changed. After the change all tests pass. Before the patch it reports the following

CASE WHEN x IS NOT NULL THEN x ELSE Int32(0) END nullable? should be false, but was true
CASE WHEN NOT x IS NULL THEN x ELSE Int32(0) END nullable? should be false, but was true
CASE WHEN x IS NOT NULL AND x = Int32(5) THEN x ELSE Int32(0) END nullable? should be false, but was true
CASE WHEN x = Int32(5) AND x IS NOT NULL THEN x ELSE Int32(0) END nullable? should be false, but was true
CASE WHEN x = Int32(5) AND x IS NOT NULL OR x = bar AND x IS NOT NULL THEN x ELSE Int32(0) END nullable? should be false, but was true

@pepijnve
Copy link
Contributor Author

@alamb I've taken the logical expression portion of the PR another step further which ensures correct answers for the expressions you mentioned earlier. I can complete the physical expression portion as well if you like. Unless you tell me this path is a dead end.

@alamb
Copy link
Contributor

alamb commented Sep 30, 2025

@alamb I've taken the logical expression portion of the PR another step further which ensures correct answers for the expressions you mentioned earlier. I can complete the physical expression portion as well if you like. Unless you tell me this path is a dead end.

Thank you -- I will try and get to this one asap. Somehow every time i think I am getting the queue of reviews under control there are like 50 new notifications ! It is a good problem to have.

@pepijnve
Copy link
Contributor Author

Thank you -- I will try and get to this one asap. Somehow every time i think I am getting the queue of reviews under control there are like 50 new notifications ! It is a good problem to have.

No pressure from my side. I just write up my notes and move on to the next thing. Async delayed response is fine.

@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 1, 2025

I experimented a bit with the rewrite + const eval approach on the physical expression side of things. While attractive and simple to implement, the downside is that it's going to be very hard to ensure the logical and physical side agree. Logical needs to work without ExecutionProps so it has less information available to it compared to the PhysicalExpr tree. I don't see a way to resolve that. As a consequence I ended up with an limited ad hoc version of const evaluation on the logical side and would have to do the same for physical which isn't really ideal from a DRY perspective.

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

Than you -- this is on my list of things to review shortly

@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 6, 2025

The main question I'm left with is an existential one. Does it make sense to have all this additional code just to try to get more accurate nullability reporting for case? If the answer is yes, then the followup question is what the best way to achieve that is. Some kind of static analysis like I'm trying here or some completely different approach altogether?

Rephrasing the questions after thinking about it a bit more:

Where do we think the bug that triggers #17801 is actually located? The root cause, at least based on my analysis, is that the ScalarUDF::simplify implementation for coalesce is returning an expression that has a different nullability value compared to the original one.
This is a schema change, and it seems this is not being handled entirely correctly causing the panic later on. The schema for the logical plan still has nullable = false, but the derived physical expression for the case returns nullable = true.
Should this be mismatch be resolved during logical plan rewriting or is this a violation of the simplify contract?

A second consideration is what the impact of inaccurate nullability reporting is. There are quite a few simplifications in the expression simplifier that take nullability into account for instance. Is there benefit to be had from being more accurate? In the same area as coalesce, nvl and nvl2 for instance currently always report nullable = true. The typical use of these functions is to ensure non-null values though.

nvl and nvl2 do not perform the lazy evaluation simplification that was added to coalesce. Would it make sense to try to align all of these? Basically redirect all these conditional expressions to case and then make that code path as efficient as possible.

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

This is a schema change, and it seems this is not being handled entirely correctly causing the panic later on. The schema for the logical plan still has nullable = false, but the derived physical expression for the case returns nullable = true.
Should this be mismatch be resolved during logical plan rewriting or is this a violation of the simplify contract?

I believe it is a violation of the (unwritten) simplify contract.

nvl and nvl2 do not perform the lazy evaluation simplification that was added to coalesce. Would it make sense to try to align all of these? Basically redirect all these conditional expressions to case and then make that code path as efficient as possible.

I think this makes a lot of sense to me

I am reviewing this PR more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pepijnve -- I re-reviewed this PR carefully and I think it is well thought out, commented and tested.

thank you for fixing this issue (that was not of your making)

My recommended next steps are:

  1. Document that simplify should not be changing the schema (including nullability) -- I'll make a PR to do so now.
  2. File tickets to convert nvl and nvl2 to use the same lazy evaluation simplification that was added to coalesce. (I will do so)

fn const_eval_predicate(&self, predicate: &Expr) -> Option<TriStateBool> {
match predicate {
// Literal null is equivalent to boolean uncertain
Expr::Literal(scalar, _) => TriStateBool::try_from(scalar).ok(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always worry about ignoring errors as creating a DataFusionError is quite expensive (it allocates a String and we have seen it show up in traces before)

However, I changed this to the following locally and all the tests still pass

            Expr::Literal(scalar, _) => Some(TriStateBool::try_from(scalar).unwrap()),

So I think this would only discard errors if the Expr was not correctly type coerced.

Thus I think this is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale here was that, since nullability analysis must to be infallible, that it would be ok to return None here which means "answer is undefined". If you were to actually try to evaluate the expression it will most likely cause a runtime error at that point so inaccurate nullability is the least of your concerns.

Err(_) => None,
}
}
Expr::IsTrue(e) => match self.const_eval_predicate(e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR -- I can't help but feel IS TRUE, IS NOT TRUE, etc is pretty redundant and we could probably have had a simpler Expr representation for this. No action required, just a musing

}

#[test]
fn test_case_expression_nullability() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an impressive list of tests

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

The new CI failure seems related to a different PR. I am investigating

@alamb
Copy link
Contributor

alamb commented Oct 8, 2025

@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 9, 2025

Thank you @pepijnve -- I re-reviewed this PR carefully and I think it is well thought out, commented and tested.

Thanks. I was starting to question my sanity a bit 😄 I'll proceed with aligning the physical expression side of things with what I've done for logical then.

@alamb
Copy link
Contributor

alamb commented Oct 9, 2025

Thanks. I was starting to question my sanity a bit 😄 I'll proceed with aligning the physical expression side of things with what I've done for logical then.

Finding any way to make this easier to maintain / understand would be most appreciated (I think you have done a pretty good job so far, FWIW)

@pepijnve pepijnve marked this pull request as draft October 9, 2025 16:24
@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 9, 2025

Finding any way to make this easier to maintain / understand would be most appreciated

I'll get there eventually. Latest commit tightens the code up further.

@alamb
Copy link
Contributor

alamb commented Oct 15, 2025

Is this PR ready to merge / review? It is marked as a draft

@pepijnve
Copy link
Contributor Author

Is this PR ready to merge / review? It is marked as a draft

No not ready yet. I think the logical plan side (assuming we're conceptually ok with the partial duplication of const evaluation) is in pretty good shape now. Physical case needs to come to the exact same conclusion though and I haven't updated that implementation yet.

@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 15, 2025

I stumbled upon

pub fn is_restrict_null_predicate<'a>(
just now which seems to be trying to do something very similar.

Interestingly the implementer chose to do the one thing I've been trying to avoid the whole time which is

let execution_props = ExecutionProps::default();

I'm not sure if this gives you correct results in all cases. Expressions that use NOW() for instance may end up reporting the wrong value.

@alamb
Copy link
Contributor

alamb commented Oct 15, 2025

just now which seems to be trying to do something very similar.

Looks like it came in via #13081 from @JasonLi-cn and @Dandandan -- maybe they have some thoughts they can share on your approach

@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 16, 2025

Poking @comphead as well since they reviewed that PR too.

let (state, plan) = df.into_parts();
let plan = state.optimize(&plan)?;
if create_physical {
let _ = state.create_physical_plan(&plan).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still dont' get this part. Previously physical plan was created on top of optimized logical, why this step is skipped? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_physical_plan calls optimize itself already. Due to the fact that the logical plan was being optimised twice the logical vs physical schema mismatch was being hidden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To see what I'm talking about, comment out this line on main and run tpcds_physical_q75. You'll get

Error: Internal("Physical input schema should be the same as the one converted from logical input schema. Differences: \n\t- field nullability at index 5 [sales_cnt]: (physical) true vs (logical) false\n\t- field nullability at index 6 [sales_amt]: (physical) true vs (logical) false")

@pepijnve
Copy link
Contributor Author

@comphead I should probably have been a bit more specific. The question I have for you relates to #13081. In that PR, a change was added at

let execution_props = ExecutionProps::default();
where a logical expression is being evaluated using default ExecutionProps.

In this PR I have a similar need to const evaluate a logical expression without having ExecutionProps available. I was a bit concerned to do so using default values because you're missing some possibly essential information. I was wondering if you had any insight into this. Is the change in #13081 possibly wrong because of the default props? Or am I worrying about a non-issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: sql_planner benchmark panic'ing on main

3 participants