Skip to content

fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947

Open
yaooqinn wants to merge 1 commit intofacebookincubator:mainfrom
yaooqinn:fix/collect-set-default-ignore-nulls
Open

fix(sparksql): Default ignoreNulls to true for collect_set backward compatibility#16947
yaooqinn wants to merge 1 commit intofacebookincubator:mainfrom
yaooqinn:fix/collect-set-default-ignore-nulls

Conversation

@yaooqinn
Copy link
Copy Markdown
Contributor

Summary

Fixes a backward compatibility bug introduced in PR #16416.

The ignoreNulls_ field in SparkCollectSetAggregate was defaulting to false (RESPECT NULLS). When the 1-arg signature collect_set(T) is used, setConstantInputs() does not receive a boolean constant, so the default value is used — which must match Spark's default behavior of ignoring nulls (true).

Root cause

// Before (broken): includes nulls by default
bool ignoreNulls_{false};

// After (fixed): ignores nulls by default (Spark's default)
bool ignoreNulls_{true};

Impact

Without this fix, any downstream consumer (e.g., Gluten) using the native collect_set with the 1-arg signature would get null elements in the output array, causing NullPointerException during Spark's result projection.

Testing

Verified in Gluten with VeloxAggregateFunctionsDefaultSuite — all 16 collect_set/collect_list tests pass after this fix.

Related: Gluten PR apache/gluten#11837

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 28, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d07e804
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69ca78f27fec4f00096c1df9

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Build Impact Analysis

Directly Changed Targets

Target Changed Files
velox_functions_spark_aggregates CollectSetAggregate.cpp

Selective Build Targets (building these covers all 5 affected)

cmake --build _build/release --target spark_aggregation_fuzzer_test velox_functions_spark_aggregates_test velox_spark_query_runner_test velox_sparksql_coverage

Total affected: 5/555 targets

All affected targets (5)
  • spark_aggregation_fuzzer_test
  • velox_functions_spark_aggregates
  • velox_functions_spark_aggregates_test
  • velox_spark_query_runner_test
  • velox_sparksql_coverage

Fast path • Graph from main@f7c243e24ac2705f4d69bc87cbcde0259ac6775b

Copy link
Copy Markdown
Collaborator

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't include non-related change

…ompatibility

The ignoreNulls_ field in SparkCollectSetAggregate was defaulting to false
(RESPECT NULLS), which breaks backward compatibility when the 1-arg signature
is used. In this case, setConstantInputs() does not receive a boolean constant,
so the default value is used — which must match Spark's default behavior of
ignoring nulls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yaooqinn yaooqinn force-pushed the fix/collect-set-default-ignore-nulls branch from 565e97c to d07e804 Compare March 30, 2026 13:21
@yaooqinn
Copy link
Copy Markdown
Contributor Author

Rebased on latest main — removed unrelated changes from the diff. Now only the 1-file fix (CollectSetAggregate.cpp).

Copy link
Copy Markdown
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add test for verify the default behavior? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants