Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set)

### Describe the bug

I discovered a significant performance regression in the DataFusion 50 release. The essence of it is that filtering/joining based on array membership is no longer performant in the DF 50 series. This is a relative regression from DF 49, where this was reasonably performant.

NOTE: I do realize that this is not a particularly great design, and that we can't effectively filter on arrays (as far as I know?) like this using pushdown techniques or probably even row group metadata. Unfortunately I'm just a consumer of this data set, and this is the only way I found to do the job. Suggestions to improve the query are of course also welcome :)

### To Reproduce

Here is SQL to reproduce in the `datafusion-cli`.

First, let's set up two external tables.

```sql
CREATE EXTERNAL TABLE categories_raw STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/categories/parquet/';

CREATE EXTERNAL TABLE places STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/places/parquet/';
```

Then, let's try doing a simple count of the places matching some subset of the categories (notes on the CTE in the additional context below).

Example query 1 (this is roughly our original query):

```sql
WITH categories_arr AS (
    SELECT array_agg(category_id) AS category_ids FROM categories_raw LIMIT 500
)
SELECT COUNT(*)
    FROM places p
    WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' AND array_has_any(p.fsq_category_ids, (SELECT category_ids FROM categories_arr));
```

Example query 2:

```sql
WITH allowed_categories AS (
    select * from categories_raw LIMIT 500
)
SELECT count(*)
FROM places p
JOIN allowed_categories ac ON array_has(p.fsq_category_ids, ac.category_id)
WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days';
```

### Expected behavior

TL;DR, I _expect_ the queries to execute fairly quickly (**low single digit number of minutes**, with some reasonable variance for internet... I'm literally on the other side of the world from us-east-1 with a reasonably fast connection).

I also expect the query execution time, at least for this particular case, to not significantly vary based on the number of categories filtered.

Finally, I'd expect the CPU usage to either go down from current levels like DataFusion 49, OR if it's going to leverage more cores, to actually go faster ;) (DataFusion 49 rarely used more than a single core computing the count as described in Query 1, and did it much faster.)

Let me explain what I observed both queries doing under different DataFusion releases...

## Query 1

**Query 1 takes approximately 2 and a half minutes for me on DataFusion 49.0.2.** This is the baseline and seems reasonable. **Increasing the LIMIT to 1000 (more realistic), the execution time remains constant.** CPU usage is moderate; it does not need to saturate all cores.

Under DataFusion 50, this just immediately consumes 100% of all CPU (not even bouncing; straight up redline haha), and doesn't seem to do much in the way of data transfer. I killed the query after around 20 minutes, since it was clear that it wasn't going to be able to complete in a reasonable time without any data transfer.

This behavior in DataFusion 50 occurs regardless of what the LIMIT value is. Even a `LIMIT 10` sends all CPUs spinning, and no significant data transfer ever occurs.

I don't know how relevant this tidbit is, but unlike most flavors of high CPU utilization, this doesn't make my MacBook hot to the touch or even trigger fans. System load averages are around 100, but it's clearly not making any progress 🤷  No unusual memory activity either.

## Query 2

I discovered Query 2 when looking for a workaround for DataFusion 50. It expresses a similar intent, and still uses an array operation (`array_has` in this case), but it's expressed more explicitly as a join condition rather than a weird subquery. However, it has _wildly_ different performance characteristics.

On DataFusion 49, this **consumes a fair bit of CPU, than Query 1, and it took around 22 minutes**! That's like 10x longer than the original query.

DataFusion 50 fares a bit better better. It uses _even more CPU_ than 49 does initially, and seems to start pulling from the network sooner/faster. **It completes this query in just under 8 minutes.**

So that's an improvement, right? Well... **if you try doubling the `LIMIT` to 1000 (more representative of my typical query), the run time (DataFusion 50) approximately doubles to 15 minutes!** This is a bit strange. But at least it finishes 😅 

## Reduced throughput streaming results

What I've included above is just the count which we do beforehand by calling `.count()` with a more useful select query. Once we start actually streaming the results, throughput is unsurprisingly slower than before. I'm seeing somewhere between 10% and (absolute best case 50%) of the former throughput (in records/second) through our pipeline.

Unfortunately the pipeline has too many downstream variables for me to cleanly disentangle that into an MRE without a lot of effort, but I guess I shouldn't be too surprised by this.

### Additional context

## Versions and test methods

I initially observed the regression on several crates.io releases as a dependency in our internal application (release build). I then reproduced using the respective datafusion-cli versions to make things easier to write up.

For DataFusion 49, I checked out the latest commit on `branch-49`, which is tagged as 49.0.2 and has hash f43df3f2ae3aafb347996c58e852cc378807095b. I then built datafusion-cli in release mode.

For DataFusion 50, I used CLI version 50.2.0 via homebrew.

All observations are on an M1 Max MacBook Pro running macOS Sequoia 15.6.1. 

I have _not_ enabled any specific session context changes (code) or set any explicit options in the CLI.

## Notes on category filter

The "real world" version of this query uses a computed VIEW with a long WHERE clause. A size-tunable simple CTE is simpler and better for the purpose of illustration though.

The "real" category filter includes about 1100 categories, and is a significant portion of the raw category list (1245 entries total). I have tested multiple variations of this, and there is (as one would expect) no difference between referencing the view and any random sample of 1100 categories.

## Size of the array seems to matter?

`array_has_any` with a small cardinality list seems to be fine. For example, you can add `(unresolved_flags IS NULL OR NOT array_has_any(unresolved_flags, ['inappropriate', 'duplicate']))` to the WHERE clause, it's fine. This actually may seem to contradict what I wrote above about `LIMIT 10` not working on Query 1. Maybe there's something with static arrays vs computed ones? I have not had a chance to explore this yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) #18070

Describe the bug

To Reproduce

Expected behavior

Query 1

Query 2

Reduced throughput streaming results

Additional context

Versions and test methods

Notes on category filter

Size of the array seems to matter?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) #18070

Description

Describe the bug

To Reproduce

Expected behavior

Query 1

Query 2

Reduced throughput streaming results

Additional context

Versions and test methods

Notes on category filter

Size of the array seems to matter?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions