Description
Is your feature request related to a problem or challenge?
As described on #11679, we can do better for high cardinality aggregates
One thing that consumes significant time in such queries is hashing, and I think we can reduce that significantly.
Specifically, for the multi-phase repartition plan, the number of hashed rows is something like
(input cardinality) + 2 * (intermediate group cardinality) * (number of partitions)
For low cardinality aggregates (e.g when the intermediate group cardinality is 1000) the second term is small (a few thousand extra hashes isn't a big deal)
However, for high cardinality aggregates (eg. when the intermediate cardinality is like 1,000,000 and there are 16 partitions) the second term is substantial
In pictures, this looks like
▲ ▲
│ │
│ │
│ │
│ │
│ │
┌───────────────────────┐ ┌───────────────────────┐ 4. The AggregateMode::Final
│GroupBy │ │GroupBy │ GroupBy computes hash(group keys)
│(AggregateMode::Final) │ │(AggregateMode::Final) │ *AGAIN* to find the correct hash
│ │ │ │ bucket
└───────────────────────┘ └───────────────────────┘
▲ ▲
│ │
└─────────────┬────────────┘
│
│
│
┌─────────────────────────┐ 3. The output of the first phase
│ Repartition │ is repartitioned by computing
│ HASH(x) │ hash(group keys) -- this is the
└─────────────────────────┘ same hash as computed in step 2.
▲
│
┌───────────────┴─────────────┐
│ │
│ │
┌─────────────────────────┐ ┌──────────────────────────┐ 2. Each AggregateMode::Partial
│ GroubyBy │ │ GroubyBy │ GroupBy hashes the group keys to
│(AggregateMode::Partial) │ │ (AggregateMode::Partial) │ find the correct hash bucket.
└─────────────────────────┘ └──────────────────────────┘
▲ ▲
│ ┌┘
│ │
.─────────. .─────────.
,─' '─. ,─' '─.
; Input : ; Input : 1. Input is read
: Partition 0 ; : Partition 1 ;
╲ ╱ ╲ ╱
'─. ,─' '─. ,─'
`───────' `───────'
This effect can be seen in profiling for ClickBench Q17:
SELECT "UserID", "SearchPhrase", COUNT(*) FROM "hits.parquet" GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10
$ datafusion-cli -c 'SELECT "UserID", "SearchPhrase", COUNT(*) FROM "hits.parquet" GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;'
Here is the profiling from Instruments:
Describe the solution you'd like
The basic idea is to avoid recompute the hash values in RepartitionExec
and AggregateMode::Final
by reuse the values from AggregateMode::Partial
(which has already computed a hash value for each input group)
Something like this
▲ ▲
│ │
│ │
│ │
│ │
│ │
┌───────────────────────┐ ┌───────────────────────┐ 4. The AggregateMode::Final
│GroupBy │ │GroupBy │ GroupBy also gets the hash values
│(AggregateMode::Final) │ │(AggregateMode::Final) │ and does not recompute them
│ │ │ │
└───────────────────────┘ └───────────────────────┘
▲ ▲ ▲
│ │ │
└─────────────┬────────────┘
Pass hash │ │
values up the │
plan rather │ │
than ┌─────────────────────────┐ 3. In addition to the partial
recomputing │ │ Repartition │ aggregates and group values, *ALSO*
them │ PRECOMPUTED_HASH │ pass the hash values to the
│ └─────────────────────────┘ RepartitionExec which also passed
▲ them on to the AggregateMode::Final
│ │
┌───────────────┴─────────────┐
│ │ │
│ │
┌─────────────────────────┐ ┌──────────────────────────┐ 2. Each AggregateMode::Partial
│ GroubyBy │ │ GroubyBy │ GroupBy hashes the group keys to
│(AggregateMode::Partial) │ │ (AggregateMode::Partial) │ find the correct hash bucket.
└─────────────────────────┘ └──────────────────────────┘
▲ ▲
│ ┌┘
│ │
.─────────. .─────────.
,─' '─. ,─' '─.
; Input : ; Input : 1. Input is read
: Partition 0 ; : Partition 1 ;
╲ ╱ ╲ ╱
'─. ,─' '─. ,─'
`───────' `───────'
Describe alternatives you've considered
We maybe could pass the data as an explicit new column somehow, or maybe as a field in a struct array 🤔
Additional context
No response