Skip to content

Improve performance of high cardinality grouping by reusing hash values #11680

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

As described on #11679, we can do better for high cardinality aggregates

One thing that consumes significant time in such queries is hashing, and I think we can reduce that significantly.

Specifically, for the multi-phase repartition plan, the number of hashed rows is something like

(input cardinality)  + 2 * (intermediate group cardinality) * (number of partitions)

For low cardinality aggregates (e.g when the intermediate group cardinality is 1000) the second term is small (a few thousand extra hashes isn't a big deal)

However, for high cardinality aggregates (eg. when the intermediate cardinality is like 1,000,000 and there are 16 partitions) the second term is substantial

In pictures, this looks like

               ▲                          ▲
               │                          │
               │                          │
               │                          │
               │                          │
               │                          │
   ┌───────────────────────┐  ┌───────────────────────┐       4. The  AggregateMode::Final
   │GroupBy                │  │GroupBy                │       GroupBy computes hash(group keys)
   │(AggregateMode::Final) │  │(AggregateMode::Final) │       *AGAIN* to find the correct hash
   │                       │  │                       │       bucket
   └───────────────────────┘  └───────────────────────┘
               ▲                          ▲
               │                          │
               └─────────────┬────────────┘
                             │
                             │
                             │
                ┌─────────────────────────┐                   3. The output of the first phase
                │       Repartition       │                   is repartitioned by computing
                │         HASH(x)         │                   hash(group keys) -- this is the
                └─────────────────────────┘                   same hash as computed in step 2.
                             ▲
                             │
             ┌───────────────┴─────────────┐
             │                             │
             │                             │
┌─────────────────────────┐  ┌──────────────────────────┐     2. Each AggregateMode::Partial
│        GroubyBy         │  │         GroubyBy         │     GroupBy hashes the group keys to
│(AggregateMode::Partial) │  │ (AggregateMode::Partial) │     find the correct hash bucket.
└─────────────────────────┘  └──────────────────────────┘
             ▲                             ▲
             │                            ┌┘
             │                            │
        .─────────.                  .─────────.
     ,─'           '─.            ,─'           '─.
    ;      Input      :          ;      Input      :          1. Input is read
    :   Partition 0   ;          :   Partition 1   ;
     ╲               ╱            ╲               ╱
      '─.         ,─'              '─.         ,─'
         `───────'                    `───────'

This effect can be seen in profiling for ClickBench Q17:

SELECT "UserID", "SearchPhrase", COUNT(*) FROM "hits.parquet" GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10
$ datafusion-cli -c 'SELECT "UserID", "SearchPhrase", COUNT(*) FROM "hits.parquet" GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;'

Here is the profiling from Instruments:
Screenshot 2024-07-26 at 4 26 14 PM

Describe the solution you'd like

The basic idea is to avoid recompute the hash values in RepartitionExec and AggregateMode::Final by reuse the values from AggregateMode::Partial (which has already computed a hash value for each input group)

Something like this

                         ▲                          ▲                                                       
                         │                          │                                                       
                         │                          │                                                       
                         │                          │                                                       
                         │                          │                                                       
                         │                          │                                                       
             ┌───────────────────────┐  ┌───────────────────────┐       4. The  AggregateMode::Final        
             │GroupBy                │  │GroupBy                │       GroupBy also gets the hash values   
             │(AggregateMode::Final) │  │(AggregateMode::Final) │       and does not recompute them         
             │                       │  │                       │                                           
             └───────────────────────┘  └───────────────────────┘                                           
               ▲         ▲                          ▲                                                       
               │         │                          │                                                       
                         └─────────────┬────────────┘                                                       
Pass hash      │                       │                                                                    
values up the                          │                                                                    
plan rather    │                       │                                                                    
than                      ┌─────────────────────────┐                   3. In addition to the partial       
recomputing    │          │       Repartition       │                   aggregates and group values, *ALSO* 
them                      │    PRECOMPUTED_HASH     │                   pass the hash values to the         
               │          └─────────────────────────┘                   RepartitionExec which also passed   
                                       ▲                                them on to the AggregateMode::Final 
               │                       │                                                                    
                       ┌───────────────┴─────────────┐                                                      
               │       │                             │                                                      
                       │                             │                                                      
          ┌─────────────────────────┐  ┌──────────────────────────┐     2. Each AggregateMode::Partial      
          │        GroubyBy         │  │         GroubyBy         │     GroupBy hashes the group keys to    
          │(AggregateMode::Partial) │  │ (AggregateMode::Partial) │     find the correct hash bucket.       
          └─────────────────────────┘  └──────────────────────────┘                                         
                       ▲                             ▲                                                      
                       │                            ┌┘                                                      
                       │                            │                                                       
                  .─────────.                  .─────────.                                                  
               ,─'           '─.            ,─'           '─.                                               
              ;      Input      :          ;      Input      :          1. Input is read                    
              :   Partition 0   ;          :   Partition 1   ;                                              
               ╲               ╱            ╲               ╱                                               
                '─.         ,─'              '─.         ,─'                                                
                   `───────'                    `───────'                                                   

Describe alternatives you've considered

We maybe could pass the data as an explicit new column somehow, or maybe as a field in a struct array 🤔

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions