Intermediate result blocked approach to aggregation memory management #15591

Rachelint · 2025-04-05T07:47:59Z

Which issue does this PR close?

Part of Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065

Rationale for this change

As mentioned in #7065 , we use a single Vec to manage aggregation intermediate results both in GroupAccumulator and GroupValues.

It is simple but not efficient enough in high-cardinality aggregation, because when Vec is not large enough, we need to allocate a new Vec and copy all data from the old one.

Copying a large amount of data(due to high-cardinality) is obviously expansive
And it is also not friendly to cpu (will refresh cache and tlb)

So this pr introduces a blocked approach to manage the aggregation intermediate results. We will never resize the Vec in the approach, and instead we split the data to blocks, when the capacity is not enough, we just allocate a new block. Detail can see #7065

What changes are included in this PR?

Implement the sketch for blocked approach
Implement blocked groups supporting PrimitiveGroupsAccumulator and GroupValuesPrimitive as the example

Are these changes tested?

Test by exist tests. And new unit tests, new fuzzy tests.

Are there any user-facing changes?

Two functions are added to GroupValues and GroupAccumulator trait.

But as you can see, there are default implementations for them, and users can choose to really support the blocked approach when wanting a better performance for their udafs.

    /// Returns `true` if this accumulator supports blocked groups.
    fn supports_blocked_groups(&self) -> bool {
        false
    }

    /// Alter the block size in the accumulator
    ///
    /// If the target block size is `None`, it will use a single big
    /// block(can think it a `Vec`) to manage the state.
    ///
    /// If the target block size` is `Some(blk_size)`, it will try to
    /// set the block size to `blk_size`, and the try will only success
    /// when the accumulator has supported blocked mode.
    ///
    /// NOTICE: After altering block size, all data in previous will be cleared.
    ///
    fn alter_block_size(&mut self, block_size: Option<usize>) -> Result<()> {
        if block_size.is_some() {
            return Err(DataFusionError::NotImplemented(
                "this accumulator doesn't support blocked mode yet".to_string(),
            ));
        }

        Ok(())
    }

Dandandan · 2025-04-08T07:36:20Z

Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement.
I'll share it with you once I have some time to validate the design (probably this evening).

Rachelint · 2025-04-08T07:54:02Z

Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement. I'll share it with you once I have some time to validate the design (probably this evening).

Really thanks. This design in pr indeed still introduces quite a few code changes...

I tried to not modify anythings about GroupAccumulator firstly:

Only implement the blocked logic in GroupValues
Then we reorder the input batch according to their block indices got from GroupValues
Apply input batch to related GroupAccumulator using slice
And when we found the new block is needed, create a new GroupAccumulator (one block one GroupAccumulator)

But I found this way will introduce too many extra cost...

Maybe we place the block indices into values in merge/update_batch as a Array?

Rachelint · 2025-04-17T12:03:17Z

Has finished development(and test) of all needed common structs!
Rest four things for this one:

Support blocked related logic in GroupedHashAggregateStream(we can copy it from Sketch for aggregation intermediate results blocked management #11943 )
Logic about deciding when we should enable this optimization
Example blocked version for GroupAccumulator and GroupValues
Unit test for blocked GroupValuesPrimitive, it is a bit complex
Fuzzy tests
Chore: fix docs, fix clippy, add more comments...

Rachelint · 2025-04-21T13:58:12Z

It is very close, just need to add more tests!

alamb · 2025-05-19T11:04:38Z

datafusion/functions-aggregate-common/benches/null_state_accumulate.rs

+};
+
+fn generate_group_indices(len: usize) -> Vec<usize> {
+    (0..len).collect()


Does using indices 0..len mean this benchmark is testing the case where each input has a new unique group?

If this is the case, then I think it would make sense that the benchmark shows lots of time allocating / zeroing memory: the bulk of the work will be copying each value into a new accumulator slot.

Rachelint · 2025-05-19T11:18:13Z

I wonder what happens if we make it more like at least 1 million or 1MiB so the effect on cache-friendliness is smaller?
We could optimize a growing strategy for the first allocated Vec if memory usage / overhead of first block is a concern.

I have tried to larger the block size(8 * batch, 16 * batch...), but it seems make slight difference to the performance.

So after experiement, I think single vector + resizing is efficient enough actually...

It is more efficient for random access
Resizing will only happen a few times, so it is acceptable actually

Rachelint · 2025-05-19T11:29:20Z

Me too -- I looked at the flamegraph you provided and I agree it seems like almost half the allocation time is spent with pagefaults / zeroing memory. However, I can't tell if that is because there is slowness with the underlying Vec that wasn't initialized or if there is something else going on.

I think I nearly understand why about this, it is possibly led by lto, lto seems found the initialization is unnecessary actually, so it remove it(just like calling set_len manually).

I suspect you already know this, but I think you can get back the original Vec from an array via

Got it!

From those numbers, is it a fair assessment that the blocked approach improves performance when there is a large number of intermediate groups, but does not when there is a small number of groups?

I think it maybe be possible in this situation?

When input for accumualtor and group values is consumed, we collect them, and transform them back to Vec.
Then we push them in accumualtor and group values.
Finally we reuse them in next round computation of accumualtor and group values?

From those numbers, is it a fair assessment that the blocked approach improves performance when there is a large number of intermediate groups, but does not when there is a small number of groups?

Yes, in current implementation, it only help to performance with large amount of intermediate groups, because the slice will be called many many time and the cost become unacceptable.
And for query with only small groups, it nearly make no difference.

Dandandan · 2025-05-19T11:42:56Z

Yes, in current implementation, it only help to performance with large amount of intermediate groups, because the slice will be called many many time and the cost become unacceptable.
And for query with only small groups, it nearly make no difference.

Yeah I think that was expected.
I think we should try to minimize the impact of this on low-cardinality cases (e.g. make sure they fit in one array, minimize the overhead of blocks)...

So after experiement, I think single vector + resizing is efficient enough actually...

Yeah it is quite efficient, although problematic for large inputs

Offset out of bounds for utf8 / binary data.
Overallocation due to exponential allocation strategy
So even with roughly the same performance I think we should still strive to make the change.

Rachelint · 2025-05-19T11:55:59Z

I wonder what happens if we make it more like at least 1 million or 1MiB so the effect on cache-friendliness is smaller?
We could optimize a growing strategy for the first allocated Vec if memory usage / overhead of first block is a concern.

I think we should try to minimize the impact of this on low-cardinality cases (e.g. make sure they fit in one array, minimize the overhead of blocks)...

If I don't misunderstand, does it mean strategy like that:

We make the block size large enough
For the first block, we still perform resizing at firstly
But after it grow large enough, we switch to blocked approach?

Yeah it is quite efficient, although problematic for large inputs

Agree. It also leads to large memory usage, because we only release memory after all the batches are returned(we hold the really large single batch in memory, and only return slice of it now, and only release memory at once after all slices are returned).

Dandandan · 2025-05-19T12:09:23Z

We make the block size large enough
For the first block, we still perform resizing at firstly
But after it grow large enough, we switch to blocked approach?

Yes - exactly!

alamb · 2025-05-25T11:28:51Z

I wonder what the plan is for this PR?

From what I understand, it currently improves performance for aggregates with large numbers of groups, but (slightly) slows down aggregates for smaller numbers of groups. I think this is due to accessing group storage via two indirections (block index / actual index)

It seems like the proposal is to have some sort of adaptive structure that uses one part indexes for small numbers of groups and then switches to two part indexes for larger numbers.

Rachelint · 2025-05-25T15:08:40Z

I wonder what the plan is for this PR?

From what I understand, it currently improves performance for aggregates with large numbers of groups, but (slightly) slows down aggregates for smaller numbers of groups. I think this is due to accessing group storage via two indirections (block index / actual index)

It seems like the proposal is to have some sort of adaptive structure that uses one part indexes for small numbers of groups and then switches to two part indexes for larger numbers.

I am experimenting if something we can do basing on this pr to improve performance more, like [memory reuse] (#15591 (comment)). Actually #16135 is part of the attempt.

My biggest concern is if we can get more obvious improvement to make the change worthy...

alamb · 2025-05-26T10:09:25Z

My biggest concern is if we can get more obvious improvement to make the change worthy...

As I understood the way to get a bigger improvement would be to implement the chunked approach for more group storage / aggregates so that more queries in our benchmarks (like ClickBench) could use the new code path

Though of course that would make this PR even bigger

We could also make a "POC" type PR with some more implementation to prove out the performance and then break it into smaller pieces for review 🤔

Rachelint · 2025-05-27T02:41:38Z

My biggest concern is if we can get more obvious improvement to make the change worthy...

As I understood the way to get a bigger improvement would be to implement the chunked approach for more group storage / aggregates so that more queries in our benchmarks (like ClickBench) could use the new code path

Though of course that would make this PR even bigger

We could also make a "POC" type PR with some more implementation to prove out the performance and then break it into smaller pieces for review 🤔

Yes, I am trying to implement it for count, and see if some improvements in q4 and q15.

But according to the flamegraph, I found all such queries' bottleneck is actually hashtable. I think performance improvement from this pr will be not obvious before we overcome hashtable(I am experimenting about clickhouse like hashtable).

I think the benefit from this one currently is:

Part of the better aggregation spilling (I will file a issue to discuss it maybe tonight)
Lower memory usage due to faster freeing batch by batch
Can make really high cardinality query around 1.1 faster

Do you think it still worthy continuing to push forward for above benefits?
I plan to sort out codes tonight, and we just make a simplest implementation in first stage (now is a bit complex due to some experiment about continue improving performance).

adriangb · 2025-05-27T02:59:33Z

Part of the better aggregation spilling (I will file a issue to discuss it maybe tonight)

Without having a full understanding of this PR (I have just been following the conversation because the change is exciting) my 2¢ is: for us memory management is currently one of the biggest thorns with DataFusion. It is quite hard at the moment to run with a fixed memory budget given the mix of exceeding memory through under accounting and refusing to run operators that can't spill / try to claim more memory than they actually use.

Dandandan · 2025-05-27T04:37:46Z

I agree with @adriangb , even when it doesn't provide performance improvements still super valuable to push it forward.

Rachelint · 2025-05-28T20:23:50Z

Thanks @adriangb @Dandandan .

I just start my new job this week and a bit busy, and I will continue to push it forward this weekend.

The new targets for this one may be

Mainly as a base of better aggregation spilling
Slightly improve performance

So the rest works I think:

File an issue about the new aggregation spilling proposal based on this one
Support block mode for NullState again

alamb · 2025-05-28T20:48:07Z

Thanks for all your help @Rachelint and congratulations on the new job

Dandandan · 2025-06-05T17:29:29Z

thanks @Rachelint and congratulations!

github-actions · 2025-08-05T02:13:45Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

alamb · 2025-08-05T10:19:18Z

It is unfortunate we never figured out how to get this over the line 😢

Thank you for the effort anyways @Rachelint

Rachelint · 2025-08-05T10:22:07Z

It is unfortunate we never figured out how to get this over the line 😢

Thank you for the effort anyways @Rachelint

It is sorry... but actually I still want to continue push it forward... However it is too busy recent few months...

alamb · 2025-08-05T10:39:27Z

It is unfortunate we never figured out how to get this over the line 😢
Thank you for the effort anyways @Rachelint

It is sorry... but actually I still want to continue push it forward... However it is too busy recent few months...

Maybe if/when you are able to return to it with a fresh set of eyes after a break we'll make progress

No worries at all -- I totally understand

github-actions · 2025-10-08T02:02:21Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

alamb · 2025-10-15T11:10:28Z

Maybe someday we'll get back to it 😢

Rachelint changed the title ~~Impl Intermeidate result blocked approach framework~~ Impl intermeidate result blocked approach framework Apr 5, 2025

Rachelint changed the title ~~Impl intermeidate result blocked approach framework~~ Impl intermeidate result blocked approach sketch Apr 5, 2025

Rachelint mentioned this pull request Apr 5, 2025

Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065

Open

2 tasks

github-actions bot added the logical-expr Logical plan and expressions label Apr 5, 2025

Rachelint mentioned this pull request Apr 8, 2025

Implement PoC block allocation for count accumulator #15642

Closed

Rachelint force-pushed the intermeidate-result-blocked-approach branch 2 times, most recently from cc37eba to f690940 Compare April 9, 2025 14:37

Rachelint mentioned this pull request Apr 9, 2025

Sketch for aggregation intermediate results blocked management #11943

Closed

github-actions bot added the functions Changes to functions implementation label Apr 10, 2025

Rachelint force-pushed the intermeidate-result-blocked-approach branch from 95c6a36 to a4c6f42 Compare April 10, 2025 11:10

github-actions bot added the physical-expr Changes to the physical-expr crates label Apr 10, 2025

Rachelint force-pushed the intermeidate-result-blocked-approach branch 6 times, most recently from 2100a5b to 0ee951c Compare April 17, 2025 11:56

Rachelint force-pushed the intermeidate-result-blocked-approach branch 2 times, most recently from c51d409 to 2863809 Compare April 20, 2025 14:46

github-actions bot added execution Related to the execution crate common Related to common crate sqllogictest SQL Logic Tests (.slt) labels Apr 21, 2025

Rachelint force-pushed the intermeidate-result-blocked-approach branch 3 times, most recently from 31d660d to 2b8dd1e Compare April 22, 2025 18:52

define the needed methods in GroupAccumulator and GroupValues.

4353748

alamb reviewed May 19, 2025

View reviewed changes

This was referenced May 19, 2025

Weekly Plan: Andrew Lamb 2025-05-19 #16101

Closed

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

Open

alamb changed the title ~~Implement intermediate result blocked approach to aggregation memory management~~ Intermediate result blocked approach to aggregation memory management Jun 2, 2025

github-actions bot added the Stale PR has not had any activity for some time label Aug 5, 2025

github-actions bot removed the Stale PR has not had any activity for some time label Aug 7, 2025

alamb mentioned this pull request Aug 14, 2025

Disproportionate memory use for DISTINCT ON query #17169

Open

alamb mentioned this pull request Oct 3, 2025

Quadratic runtime in MinMaxBytesAccumulator #17897

Closed

github-actions bot added the Stale PR has not had any activity for some time label Oct 8, 2025

github-actions bot closed this Oct 15, 2025

Intermediate result blocked approach to aggregation memory management #15591

Intermediate result blocked approach to aggregation memory management #15591

Uh oh!

Conversation

Rachelint commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Apr 8, 2025

Uh oh!

Rachelint commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented Apr 21, 2025

Uh oh!

alamb May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Rachelint commented May 19, 2025

Uh oh!

Rachelint commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented May 19, 2025

Uh oh!

Rachelint commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 25, 2025

Uh oh!

Rachelint commented May 25, 2025

Uh oh!

alamb commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented May 27, 2025

Uh oh!

Dandandan commented May 27, 2025

Uh oh!

Rachelint commented May 28, 2025

Uh oh!

alamb commented May 28, 2025

Uh oh!

Dandandan commented Jun 5, 2025

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

alamb commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachelint commented Aug 5, 2025

Uh oh!

alamb commented Aug 5, 2025

Uh oh!

github-actions bot commented Oct 8, 2025

Uh oh!

alamb commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Rachelint commented Apr 5, 2025 •

edited

Loading

Rachelint commented Apr 8, 2025 •

edited

Loading

Rachelint commented Apr 17, 2025 •

edited

Loading

Rachelint commented May 19, 2025 •

edited

Loading

Rachelint commented May 19, 2025 •

edited

Loading

Dandandan commented May 19, 2025 •

edited

Loading

alamb commented May 26, 2025 •

edited

Loading

Rachelint commented May 27, 2025 •

edited

Loading

alamb commented Aug 5, 2025 •

edited

Loading