fix: use input precision for avg decimal return type#1027
Draft
kadinrabo wants to merge 1 commit intosubstrait-io:mainfrom
Draft
fix: use input precision for avg decimal return type#1027kadinrabo wants to merge 1 commit intosubstrait-io:mainfrom
kadinrabo wants to merge 1 commit intosubstrait-io:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Change the return type of
avgforDECIMAL<P,S>fromDECIMAL<38,S>toDECIMAL<P,S>infunctions_arithmetic_decimal.yaml.The current spec widens the output precision of
AVG(DECIMAL<P,S>)to the maximum (38), matching whatSUMdoes. However, unlike SUM, AVG is mathematically bounded by the range of the input values — the result of an average can never exceed the min/max of the input set. Precision widening is unnecessary for AVG and causes type incompatibilities in downstream query optimizers.Problem
Consumers that use Substrait as an interchange format between a SQL frontend and an optimizer (e.g., Apache Calcite) encounter type validation errors because the optimizer's internal type inference for AVG returns
DECIMAL<P,S>(same as input, per the SQL standard), while the Substrait plan declaresDECIMAL<38,S>.Specifically, Calcite's
AggregateReduceFunctionsRulerewritesAVG(x) -> SUM(x) / COUNT(x). During this rewrite, it re-derives the intermediate types from the input column type. When the input isDECIMAL<10,2>but the AVG's declared output isDECIMAL<38,2>, the rule detects a type mismatch and throws:This only affects DECIMAL — other numeric types (INT, FLOAT) don't have parameterized precision, so no mismatch occurs.
Why AVG is different from SUM
SUM legitimately needs the widened precision. Summing many
DECIMAL(10,2)values can overflow 10 digits of precision — widening to 38 prevents this. The current spec is correct for SUM:AVG does not have this problem. The average of any set of
DECIMAL(P,S)values is bounded by[min(input), max(input)], which by definition fits inDECIMAL(P,S). The only precision concern is in the fractional part from the division, which is a scale issue — and the scaleSis already preserved.Note that the
intermediatetype correctly usesDECIMAL<38,S>for the SUM component of the decomposed AVG (SUM + COUNT), which is appropriate. Only the final return type should match the input.What other systems do
The proposed change aligns Substrait with the SQL standard and the majority of query engines.
Change
- name: "avg" description: Average a set of values. impls: - args: - name: x value: "DECIMAL<P,S>" options: overflow: values: [ SILENT, SATURATE, ERROR ] nullability: DECLARED_OUTPUT decomposable: MANY intermediate: "STRUCT<DECIMAL<38,S>,i64>" - return: "DECIMAL<38,S>" + return: "DECIMAL<P,S>"Note: the
intermediatetype remainsSTRUCT<DECIMAL<38,S>,i64>— the SUM component of the decomposed AVG still needs widened precision to prevent overflow during accumulation.Impact
resolveVariantfunction inexpr/functions.gowill automatically resolveAVG(DECIMAL<P,S>)toDECIMAL<P,S>instead ofDECIMAL<38,S>once the YAML is updated.DECIMAL<P,S>range.This change is