Disproportionate memory use for `DISTINCT ON` query

### Describe the bug

We have a [parquet file](https://digitalsociety-public.fsn1.your-objectstorage.com/f94d0c87-8798-4bf6-9c98-8d89971e2539.parquet) (built from [public data](https://statistics.gov.scot/data/domestic-energy-performance-certificates)) with 106 columns and 1M rows which is 131.14 MiB in size (compressed, 913.89 MiB uncompressed).

When running a `DISTINCT ON` query using the unbounded memory pool, memory use climbs to over 160 GiB for this query:

```sql
SELECT DISTINCT
  ON ("ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE") *
FROM
  table
ORDER BY
  "ADDRESS1",
  "ADDRESS2",
  "ADDRESS3",
  "POSTCODE",
  "INSPECTION_DATE" DESC
```

When using a fair spill pool with 10 GiB, memory usage reaches "only" 30 GiB.

These results were observed on my local machine (MacBook Pro). On a production machine with the same 10 GiB limit we have seen a graceful allocation failure:

```
Resources exhausted: Failed to allocate additional 55.0 MB for GroupedHashAggregateStream[3]
```

This makes me think it could be the same underlying issue as https://github.com/apache/datafusion/issues/13831, exacerbated by the many columns.

### To Reproduce

See the parquet file and SQL query in the description above.

### Expected behavior

In an ideal world, the memory usage for this query would respect the memory pool limit (or only use "small" allocations as described in [the docs](https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-design)).

### Additional context

I'm happy to help diagnose this further (and potentially fix) with some advice on how to profile the memory use or narrow down the cause. For now I just wanted to capture the issue to see if it's known as I imagine it won't be an easy fix 😄 

I know there are a few issues related to memory management floating around atm but none that I could see directly mentioned `DISTINCT ON`, so apologies if this is a duplicate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disproportionate memory use for `DISTINCT ON` query #17169

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Disproportionate memory use for DISTINCT ON query #17169

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Disproportionate memory use for `DISTINCT ON` query #17169