Skip to content

Disproportionate memory use for DISTINCT ON query #17169

@connec

Description

@connec

Describe the bug

We have a parquet file (built from public data) with 106 columns and 1M rows which is 131.14 MiB in size (compressed, 913.89 MiB uncompressed).

When running a DISTINCT ON query using the unbounded memory pool, memory use climbs to over 160 GiB for this query:

SELECT DISTINCT
  ON ("ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE") *
FROM
  table
ORDER BY
  "ADDRESS1",
  "ADDRESS2",
  "ADDRESS3",
  "POSTCODE",
  "INSPECTION_DATE" DESC

When using a fair spill pool with 10 GiB, memory usage reaches "only" 30 GiB.

These results were observed on my local machine (MacBook Pro). On a production machine with the same 10 GiB limit we have seen a graceful allocation failure:

Resources exhausted: Failed to allocate additional 55.0 MB for GroupedHashAggregateStream[3]

This makes me think it could be the same underlying issue as #13831, exacerbated by the many columns.

To Reproduce

See the parquet file and SQL query in the description above.

Expected behavior

In an ideal world, the memory usage for this query would respect the memory pool limit (or only use "small" allocations as described in the docs).

Additional context

I'm happy to help diagnose this further (and potentially fix) with some advice on how to profile the memory use or narrow down the cause. For now I just wanted to capture the issue to see if it's known as I imagine it won't be an easy fix 😄

I know there are a few issues related to memory management floating around atm but none that I could see directly mentioned DISTINCT ON, so apologies if this is a duplicate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions