HIVE-28675: Maximize the removal of redundant columns from GROUP BY clauses #5586

zabetak · 2024-12-20T22:26:48Z

What changes were proposed in this pull request?

Enhance HiveRelFieldTrimmer to remove the maximum number of redundant columns from the GROUP BY clause.

Why are the changes needed?

Generate more efficient plans by pruning as many columns as possible (less CPU/IO/network cost).
Avoid missing optimization opportunities by examining all candidates.

For more see HIVE-28675.

Does this PR introduce any user-facing change?

More efficient query plans.

Is the change a dependency upgrade?

No

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver.java -Dqfile="cbo_groupby_remove_key.q"

…lauses

soumyakanti3578 · 2025-01-02T20:21:11Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRelFieldTrimmer.java

-      if (aggregate.getGroupSet().contains(key)) {
-        groupByUniqueKey = key;
-        break;
+      ImmutableBitSet removableCols = originalGroupSet.except(key).except(fieldsUsed);


Is it possible to compute except(fieldsUsed) outside the loop? I believe changing the order of excepts will still yield the same resulting set.

Yes we can do this, but just want to be careful that if there is no uniquekey match then we should not remove any columns at all(should skip the except in the return statement below in that case)

Just realized that we only update columnsToRemove when there is atleast one match. So I think yes we can move the except(fieldsUsed) outside the for loop.
Also is it efficient to loop the fieldsUsed(and then check if there is a field that is a unique key and part of aggregate keys to retain only fieldsUsed) vs looping uniquekeys? Mostly depends on size of fieldsUsed vs uniqueKeys. I will probably leave it upto you to decide on that.

Yes, it is possible to compute except(fieldsUsed) outside the loop. I applied the suggestion in e89386d.

Comparing the efficiency of the iteration between fieldsUsed and uniqueKeys is not possible because the semantics are different. The uniqueKeys variable is a set of sets (Set<ImmutableBitSet>) while fieldsUsed is a single set (ImmutableBitSet). Note that a key is not necessarily a single column but a set of columns (composite key).

Thank you for your explanation. Makes sense to me.

sonarqubecloud · 2025-01-03T14:00:54Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

soumyakanti3578

LGTM!

ramesh0201 · 2025-01-03T20:38:39Z

LGTM +1. Just left a minor comment for my understanding, please feel free to merge if this question is irrelevant. :)

ramesh0201 · 2025-01-03T20:40:43Z

ql/src/test/queries/clientpositive/cbo_groupby_remove_key.q

+EXPLAIN CBO SELECT passport, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport;
+EXPLAIN CBO SELECT fname, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport;
+EXPLAIN CBO SELECT lname, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport;
+EXPLAIN CBO SELECT fname, lname, COUNT(1) FROM passenger GROUP BY id, fname, lname, passport;


In this case, is having group by fname,lname always a better plan? -- even if there is a different aggregate function?

HIVE-28675: Maximize the removal of redundant columns from GROUP BY c…

0859de2

…lauses

asf-ci-hive added tests pending tests passed and removed tests pending labels Dec 20, 2024

zabetak marked this pull request as ready for review December 31, 2024 16:09

soumyakanti3578 reviewed Jan 2, 2025

View reviewed changes

HIVE-28675: Compute unused grouping keys outside of the for-loop

e89386d

asf-ci-hive added tests pending and removed tests passed labels Jan 3, 2025

asf-ci-hive added tests passed and removed tests pending labels Jan 3, 2025

soumyakanti3578 approved these changes Jan 3, 2025

View reviewed changes

ramesh0201 reviewed Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28675: Maximize the removal of redundant columns from GROUP BY clauses #5586

HIVE-28675: Maximize the removal of redundant columns from GROUP BY clauses #5586

zabetak commented Dec 20, 2024 •

edited

Loading

soumyakanti3578 Jan 2, 2025

ramesh0201 Jan 3, 2025

ramesh0201 Jan 3, 2025

zabetak Jan 3, 2025

zabetak Jan 3, 2025

ramesh0201 Jan 3, 2025

sonarqubecloud bot commented Jan 3, 2025

soumyakanti3578 left a comment

ramesh0201 commented Jan 3, 2025 •

edited

Loading

ramesh0201 Jan 3, 2025

HIVE-28675: Maximize the removal of redundant columns from GROUP BY clauses #5586

Are you sure you want to change the base?

HIVE-28675: Maximize the removal of redundant columns from GROUP BY clauses #5586

Conversation

zabetak commented Dec 20, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

soumyakanti3578 Jan 2, 2025

Choose a reason for hiding this comment

ramesh0201 Jan 3, 2025

Choose a reason for hiding this comment

ramesh0201 Jan 3, 2025

Choose a reason for hiding this comment

zabetak Jan 3, 2025

Choose a reason for hiding this comment

zabetak Jan 3, 2025

Choose a reason for hiding this comment

ramesh0201 Jan 3, 2025

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 3, 2025

Quality Gate passed

soumyakanti3578 left a comment

Choose a reason for hiding this comment

ramesh0201 commented Jan 3, 2025 • edited Loading

ramesh0201 Jan 3, 2025

Choose a reason for hiding this comment

zabetak commented Dec 20, 2024 •

edited

Loading

ramesh0201 commented Jan 3, 2025 •

edited

Loading