Skip to content

Support Avg distinct #15356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

qazxcdswe123
Copy link
Contributor

@qazxcdswe123 qazxcdswe123 commented Mar 22, 2025

Which issue does this PR close?

Rationale for this change

Related: #15099

The pattern is mostly copied from native.rs

What changes are included in this PR?

  1. implemented a separate accumulator type for Float64 and Decimal128 Decimal256, and thus implemented avg distinct
  2. tests and doc

Are these changes tested?

Current test (used to be a failed query) passed and new ones are added

Are there any user-facing changes?

No

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 22, 2025
query RT
select avg(distinct c), arrow_typeof(avg(distinct c)) from t_decimal;
----
180 Decimal128(14, 8)
Copy link
Contributor Author

@qazxcdswe123 qazxcdswe123 Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the type Decimal128(14, 8) and Decimal256(54, 6) below is correct

Update: this is correct but the way we get the result is incorrect

@qazxcdswe123 qazxcdswe123 marked this pull request as draft March 22, 2025 14:00
@github-actions github-actions bot added the core Core DataFusion crate label Mar 22, 2025
@github-actions github-actions bot added the proto Related to proto crate label Mar 22, 2025
@qazxcdswe123 qazxcdswe123 marked this pull request as ready for review March 22, 2025 15:12
@@ -60,6 +64,17 @@ make_udaf_expr_and_func!(
avg_udaf
);

pub fn avg_distinct(expr: Expr) -> Expr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AggregateExprBuilder::new().distinct() is enough, we don't need this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I added this becuase there's a count_distinct

pub fn count_distinct(expr: Expr) -> Expr {

Should we deprecate this count_distinct?

Copy link
Contributor

@jayzhan211 jayzhan211 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API exists for a long time ago so we keep it. We don't need to deprecate this

/// Specialized implementation of `AVG DISTINCT` for Float64 values, handling the
/// special case for NaN values and floating-point equality.
#[derive(Debug, Default)]
pub struct Float64DistinctAvgAccumulator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you checkout DistinctSumAccumulator, it supports f64, decimal together in single DistinctSumAccumulator, I guess we can unify these types for average too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it totally works. I should have seen this earlier.

pub struct DecimalDistinctAvgAccumulator<T: DecimalType + Debug> {
values: HashSet<Hashable<T::Native>, RandomState>,
sum_scale: i8,
target_precision: u8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Sum, we don't specify precision and scale, so we might not need it for avg

@qazxcdswe123
Copy link
Contributor Author

Closing this as I splitted the PR into 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions Changes to functions implementation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

avg(distinct) support
2 participants