fix: UnnestExec preserves relevant equivalence properties of input #16985

vegarsti · 2025-07-30T17:47:52Z

Which issue does this PR close?

Closes unnest should preserve the input's equivalence properties for uninvolved columns #15231.

What changes are included in this PR?

In UnnestExec's compute_properties we now construct itsEquivalenceProperties using what we can from the input plan, so that we preserve sort ordering of unrelated columns (and avoid unnecessary sorts further up in the plan).

Are these changes tested?

Adds test cases to the sqllogictests for UnnestExec in unnest.slt

Are there any user-facing changes?

No

Explanation

Given a struct or array value col, unnest(col) takes the N elements of col and "spreads" these onto N rows, where all other columns in the statement are preserved. Said another way, when we unnest a column we are inserting a lateral cross-join against its elements, which by construction:

Duplicates every other column once for each array/map element
Replaces the original collection column with one (or more) “element” columns
Expands one input row into zero (if empty) or many output rows

E.g. (from unnest.slt):

datafusion/datafusion/sqllogictest/test_files/unnest.slt

Lines 699 to 712 in 6d9b76e

    
           query III 
        
           select unnest(column1) c1, unnest(column2) c2, column3 c3 from unnest_table group by c1, c2, c3 order by c1, c2, c3; 
        
           ---- 
        
           1 7 1 
        
           2 NULL 1 
        
           3 NULL 1 
        
           4 8 2 
        
           5 9 2 
        
           6 11 3 
        
           12 NULL NULL 
        
           NULL 10 2 
        
           NULL 12 3 
        
           NULL 42 NULL 
        
           NULL NULL NULL

The EquivalenceProperties struct has three types of properties:

equivalence groups (expressions with the same value)
ordering equivalence classes (expressions that define the same ordering)
table constraints - a set of columns that form a primary key or a unique key

In this PR we construct the UnnestExec node's EquivalenceProperties by using the input plan's equivalence properties for the columns that are not transformed - except for table constraints, which we discard entirely. The reasoning for discarding constraints is that because we're duplicating the other columns across rows, we are invalidating any uniqueness or primary-key constraint. We also need to some twiddling with the mapping of the projection (indices change due to the unnesting).

datafusion/physical-expr/src/equivalence/ordering.rs

datafusion/physical-plan/src/unnest.rs

datafusion/sqllogictest/test_files/unnest.slt

vegarsti · 2025-07-30T20:26:37Z

Tagging @alamb, maybe you can trigger CI? 🙏🏻

vegarsti · 2025-07-31T05:50:28Z

datafusion/physical-plan/src/unnest.rs

-/// For list unnesting, each rows is vertically transformed into multiple rows
-/// For struct unnesting, each columns is horizontally transformed into multiple columns,
+/// For list unnesting, each row is vertically transformed into multiple rows
+/// For struct unnesting, each column is horizontally transformed into multiple columns,


Grammar fix

asubiotto

Nice!

datafusion/sqllogictest/test_files/unnest.slt

datafusion/physical-plan/src/unnest.rs

vegarsti · 2025-08-03T07:43:39Z

I discovered EquivalenceProperties.project, which seems to do what we need: We can get the unnest plan's equivalence properties by doing input_eq_properties.project(unnested_columns, schema), i.e. discarding the properties of the columns that are being unnested.

I have updated the PR doing that.

Equivalence properties are

equivalence groups (expressions with the same value)
ordering equivalence classes (expressions that define the same ordering)
table constraints - these can be primary key or unique

I am pretty sure that this takes care of 1 and 2, since we now have no equivalence properties for the columns. I am not yet sure about 3, though - if the original expression uses a column that is a primary key, after the unnest we will have multiple rows with the same column. Does that mean we need to remove that constraint from the eq properties? It kinda sounds like yes, but I need to see exactly what it's being used for.

vegarsti · 2025-08-03T17:46:59Z

I discovered EquivalenceProperties.project, which seems to do what we need: We can get the unnest plan's equivalence properties by doing input_eq_properties.project(unnested_columns, schema), i.e. discarding the properties of the columns that are being unnested.

I have updated the PR doing that.

Equivalence properties are

equivalence groups (expressions with the same value)

ordering equivalence classes (expressions that define the same ordering)

table constraints - these can be primary key or unique

I am pretty sure that this takes care of 1 and 2, since we now have no equivalence properties for the columns. I am not yet sure about 3, though - if the original expression uses a column that is a primary key, after the unnest we will have multiple rows with the same column. Does that mean we need to remove that constraint from the eq properties? It kinda sounds like yes, but I need to see exactly what it's being used for.

After reading some more I have now updated it so that we remove any constraint from the properties. I've updated the PR description.

I think this is semantically sound now.

FYI @alamb and @asubiotto

asubiotto

Nice work! This LGTM, I'll leave it to @alamb for a final review and CI kickoff.

datafusion/physical-plan/src/unnest.rs

alamb · 2025-08-09T10:06:21Z

Sorry to bug you again @alamb, do you have time to review this today or tomorrow? 👀 Do let me know if there's anything that I can do to make it easier to review.

Hi @vegarsti - sorry I didn't see this earlier. I will try and review it over the next day or two

Maybe @berkaysynnada or @suremarc has some time to review as well

vegarsti · 2025-08-21T07:43:14Z

Friendly ping @alamb 😄

alamb

Thank you for this contribution @vegarsti

I am sorry for the delayed review -- I am always trying to encourage others to review PRs, but indeed I often function as the reviewer of last resort. Anything you can do to help (like help review PRs yourself) would be most appreciated!

This is definitely the right direction, but when I did some testing of this PR some of the behavior didn't make sense to me

Could you look at the test I provided, as well as add additional cases:

That unnest a struct (the one I provided unnests a list)
has multiple list/structs unnested (as the code seems to handle such a case)

datafusion/sqllogictest/test_files/unnest.slt

vegarsti · 2025-08-22T15:12:16Z

Thank you for this contribution @vegarsti

I am sorry for the delayed review -- I am always trying to encourage others to review PRs, but indeed I often function as the reviewer of last resort. Anything you can do to help (like help review PRs yourself) would be most appreciated!

This is definitely the right direction, but when I did some testing of this PR some of the behavior didn't make sense to me

Could you look at the test I provided, as well as add additional cases:

That unnest a struct (the one I provided unnests a list)

has multiple list/structs unnested (as the code seems to handle such a case)

Thank you so much for the detailed and gracious review. Thanks for catching the weird behavior, I will address this.

And I am happy to start reviewing PRs!

alamb · 2025-09-04T10:57:27Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

vegarsti · 2025-09-04T11:32:01Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

Indeed, thanks!

vegarsti · 2025-09-13T19:19:41Z

Figured out why the test @alamb added failed -- the way I was creating the projection mapping was too simplistic, causing indexes to not match. Will add the two requested test cases as well.

vegarsti · 2025-09-13T19:43:59Z

Figured out why the test @alamb added failed -- the way I was creating the projection mapping was too simplistic, causing indexes to not match. Will add the two requested test cases as well.

Added two similar test cases:

with struct
with nested array, array, and struct

# cargo test --test sqllogictests -- unnest
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.19s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-8ad6b462cb0c808e)
Completed 1 test files in 0 seconds

vegarsti · 2025-09-26T10:55:14Z

Since CI ran on this one, I'll leave it here without updating the branch until this gets reviewed again 👍🏻

vegarsti · 2025-10-02T08:07:55Z

@berkaysynnada @suremarc @alamb Gentle ping for a review!

tobixdev

To me, the changes and tests make sense. Thanks!

CAVEAT: I am by no means a DataFusion pro. Just trying to learn more while providing some feedback. :)

tobixdev · 2025-10-13T16:09:55Z

datafusion/physical-plan/src/unnest.rs

+            .iter()
+            .enumerate()
+            .filter(|(idx, _)| {
+                !list_column_indices.contains(idx) && !struct_column_indices.contains(idx)


I think we have had multiple issues with quadratic planning time for a large amount of columns. I think we could get the same problem here as the contains is another linear scan, thus creating a quadratic runtime depending on the number of columns. I could also be wrong and this doesn't cause an issue.

Maybe we could build a buffer and then simply index into the buffer on whether this is unnested (not tested):

let input_schema = input.schema(); let mut unnested_indices = BooleanBufferBuilder::new(input.len()); unnested_indices.append_n(input.len(), false); for list_unnest in list_column_indices { unnested_indices.set_bit(list_unnest.index_in_input_schema, true); } for list_unnest in struct_column_indices { unnested_indices.set_bit(*list_unnest, true) } let unnested_indices = unnested_indices.finish(); let non_unnested_indices: Vec<usize> = (0..input_schema.fields().len()) .filter(|idx| !unnested_indices.value(*idx)) .collect();

Otherwise, I think changing the iterator to (0..input_schema.fields().len()) would help with readability as you don't seem to be using the actual field.

Otherwise, I think changing the iterator to (0..input_schema.fields().len()) would help with readability as you don't seem to be using the actual field.

Definitely doing this! Thank you.

Good idea to build a buffer and index into it. I'll give that a shot and see how it turns out!

Yeah I'd only change that if its easy to do with similar complexity. I think the quadratic behavior only makes a problem if we have many many columns and most of them use unnest.

This worked very well! Added in c42c8c1. I think the buffer approach you gave is more readable as well, so it's win win! Thanks a lot

datafusion/physical-plan/src/unnest.rs

tobixdev · 2025-10-13T16:40:27Z

Maybe one additional note: I think the resulting sort properties can be improved for unnesting structs if we know that the struct columns themselves are ordered.

If that makes sense we could also somehow expand the LexSort entry for the struct column.

But as this is already an improvement I think that tracking this in a separate issue is fine.

vegarsti · 2025-10-13T17:04:05Z

Thank you so much @tobixdev!

vegarsti · 2025-10-13T17:07:44Z

Maybe one additional note: I think the resulting sort properties can be improved for unnesting structs if we know that the struct columns themselves are ordered.

If that makes sense we could also somehow expand the LexSort entry for the struct column.

But as this is already an improvement I think that tracking this in a separate issue is fine.

Great idea!

adriangb · 2025-10-14T14:14:06Z

I took a look and it seems all good to me but given there's already been a lot of review on it I think the existing reviewers need to approve for it to be mergeable, so I will defer to them. Consider this my token ✅

alamb · 2025-10-14T20:55:45Z

Looks like there are some outstanding comments from @tobixdev -- please ping me @vegarsti when you have addressed them @vegarsti and are ready for a final review / stamp

vegarsti · 2025-10-15T04:25:32Z

Thanks a lot everyone! @alamb ready for the stamp now ;)

alamb

Thank you for this contribution @vegarsti

I think this is very close. I think it should have:

Some additional tests / comments cleanup (see comments)
Avoid unwrap / expect to minimize the severity of symptoms

alamb · 2025-10-16T13:56:00Z

datafusion/sqllogictest/test_files/unnest.slt

+physical_plan
+01)ProjectionExec: expr=[array_agg(unnested.ar)@1 as array_agg(unnested.ar)]
+02)--AggregateExec: mode=FinalPartitioned, gby=[generated_id@0 as generated_id], aggr=[array_agg(unnested.ar)], ordering_mode=Sorted
+03)----SortExec: expr=[generated_id@0 ASC NULLS LAST], preserve_partitioning=[true]


this plan shows the data being sorted, but the comment suggests it should not be 🤔

Could you please explain in more detail what you expect this explain plan to be showing? Given there is no ORDER BY in the query (or in the OVER clause) it is not clear why this is testing ordering

datafusion/physical-plan/src/unnest.rs

alamb · 2025-10-16T15:49:05Z

datafusion/sqllogictest/test_files/unnest.slt

+3 400
+1 400
+
+# Explain should not have a SortExec


Could you also please add two additional tests:

a negative test case here. order by the output of the unnest and verify that it is in fact sorted correctly

A case with the ordering column as the first index (e.g. tuples like (100, [3,2,1], 'a') and then order by 100

github-actions bot added physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jul 30, 2025

vegarsti mentioned this pull request Jul 30, 2025

unnest should preserve the input's equivalence properties for uninvolved columns #15231

Open

vegarsti force-pushed the unnest-equivalence branch from 7527f16 to 5838f45 Compare July 30, 2025 17:49

vegarsti commented Jul 30, 2025

View reviewed changes

datafusion/physical-expr/src/equivalence/ordering.rs Outdated Show resolved Hide resolved

vegarsti commented Jul 30, 2025

View reviewed changes

datafusion/physical-plan/src/unnest.rs Show resolved Hide resolved

vegarsti commented Jul 30, 2025

View reviewed changes

datafusion/physical-plan/src/unnest.rs Outdated Show resolved Hide resolved

vegarsti commented Jul 30, 2025

View reviewed changes

datafusion/sqllogictest/test_files/unnest.slt Outdated Show resolved Hide resolved

vegarsti changed the title ~~Preserve the equivalence properties of the input plan in unnest~~ fix: Preserve equivalence properties of the input plan in unnest Jul 30, 2025

vegarsti changed the title ~~fix: Preserve equivalence properties of the input plan in unnest~~ fix: Preserve equivalence properties of input plan in unnest Jul 30, 2025

vegarsti force-pushed the unnest-equivalence branch from 5838f45 to 751a8ba Compare July 31, 2025 05:49

vegarsti commented Jul 31, 2025

View reviewed changes

asubiotto reviewed Jul 31, 2025

View reviewed changes

datafusion/sqllogictest/test_files/unnest.slt Outdated Show resolved Hide resolved

datafusion/physical-plan/src/unnest.rs Show resolved Hide resolved

vegarsti force-pushed the unnest-equivalence branch 4 times, most recently from 523eefd to 80567ec Compare August 3, 2025 07:23

github-actions bot removed the physical-expr Changes to the physical-expr crates label Aug 3, 2025

vegarsti changed the title ~~fix: Preserve equivalence properties of input plan in unnest~~ fix: UnnestExec preserves possible equivalence properties of input plan Aug 3, 2025

vegarsti changed the title ~~fix: UnnestExec preserves possible equivalence properties of input plan~~ fix: UnnestExec preserves possible equivalence properties of inpu Aug 3, 2025

vegarsti changed the title ~~fix: UnnestExec preserves possible equivalence properties of inpu~~ fix: UnnestExec preserves possible equivalence properties of input Aug 3, 2025

vegarsti force-pushed the unnest-equivalence branch from 80567ec to a17ec47 Compare August 3, 2025 14:17

vegarsti changed the title ~~fix: UnnestExec preserves possible equivalence properties of input~~ fix: UnnestExec preserves relevant equivalence properties of input Aug 3, 2025

asubiotto approved these changes Aug 4, 2025

View reviewed changes

datafusion/physical-plan/src/unnest.rs Outdated Show resolved Hide resolved

vegarsti force-pushed the unnest-equivalence branch 2 times, most recently from 79ec7e6 to f1e889d Compare August 4, 2025 08:58

alamb reviewed Aug 22, 2025

View reviewed changes

datafusion/sqllogictest/test_files/unnest.slt Show resolved Hide resolved

alamb marked this pull request as draft September 4, 2025 10:57

vegarsti force-pushed the unnest-equivalence branch from 5ab9778 to c5ebd82 Compare September 13, 2025 19:17

vegarsti force-pushed the unnest-equivalence branch from c5ebd82 to 7099d11 Compare September 13, 2025 19:43

vegarsti marked this pull request as ready for review September 13, 2025 19:44

vegarsti force-pushed the unnest-equivalence branch 4 times, most recently from 0cf176b to 95cdb26 Compare September 19, 2025 10:14

vegarsti force-pushed the unnest-equivalence branch from c23e4a9 to 4093afb Compare September 26, 2025 09:01

tobixdev approved these changes Oct 13, 2025

View reviewed changes

vegarsti added 2 commits October 15, 2025 06:24

Preserve input's equivalence properties in UnnestExec

acecca1

Use a buffer for the indices to avoid quadratic contains

c42c8c1

vegarsti force-pushed the unnest-equivalence branch from 4093afb to c42c8c1 Compare October 15, 2025 04:24

alamb reviewed Oct 16, 2025

View reviewed changes

	query III
	select unnest(column1) c1, unnest(column2) c2, column3 c3 from unnest_table group by c1, c2, c3 order by c1, c2, c3;
	----
	1 7 1
	2 NULL 1
	3 NULL 1
	4 8 2
	5 9 2
	6 11 3
	12 NULL NULL
	NULL 10 2
	NULL 12 3
	NULL 42 NULL
	NULL NULL NULL

fix: UnnestExec preserves relevant equivalence properties of input #16985

Are you sure you want to change the base?

fix: UnnestExec preserves relevant equivalence properties of input #16985

Conversation

vegarsti commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Explanation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vegarsti commented Jul 30, 2025

Uh oh!

vegarsti Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

asubiotto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vegarsti commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asubiotto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Aug 9, 2025

Uh oh!

vegarsti commented Aug 21, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vegarsti commented Aug 22, 2025

Uh oh!

alamb commented Sep 4, 2025

Uh oh!

vegarsti commented Sep 4, 2025

Uh oh!

vegarsti commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Sep 26, 2025

Uh oh!

vegarsti commented Oct 2, 2025

Uh oh!

tobixdev left a comment

Choose a reason for hiding this comment

Uh oh!

tobixdev Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tobixdev Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tobixdev commented Oct 13, 2025

Uh oh!

vegarsti commented Jul 30, 2025 •

edited

Loading

vegarsti commented Aug 3, 2025 •

edited

Loading

vegarsti commented Aug 3, 2025 •

edited

Loading

asubiotto left a comment •

edited

Loading

vegarsti commented Sep 13, 2025 •

edited

Loading

vegarsti commented Sep 13, 2025 •

edited

Loading

vegarsti Oct 13, 2025 •

edited

Loading

vegarsti commented Oct 15, 2025 •

edited

Loading