-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
I discovered a significant performance regression in the DataFusion 50 release. The essence of it is that filtering/joining based on array membership is no longer performant in the DF 50 series. This is a relative regression from DF 49, where this was reasonably performant.
NOTE: I do realize that this is not a particularly great design, and that we can't effectively filter on arrays (as far as I know?) like this using pushdown techniques or probably even row group metadata. Unfortunately I'm just a consumer of this data set, and this is the only way I found to do the job. Suggestions to improve the query are of course also welcome :)
To Reproduce
Here is SQL to reproduce in the datafusion-cli
.
First, let's set up two external tables.
CREATE EXTERNAL TABLE categories_raw STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/categories/parquet/';
CREATE EXTERNAL TABLE places STORED AS PARQUET LOCATION 's3://fsq-os-places-us-east-1/release/dt=2025-09-09/places/parquet/';
Then, let's try doing a simple count of the places matching some subset of the categories (notes on the CTE in the additional context below).
Example query 1 (this is roughly our original query):
WITH categories_arr AS (
SELECT array_agg(category_id) AS category_ids FROM categories_raw LIMIT 500
)
SELECT COUNT(*)
FROM places p
WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' AND array_has_any(p.fsq_category_ids, (SELECT category_ids FROM categories_arr));
Example query 2:
WITH allowed_categories AS (
select * from categories_raw LIMIT 500
)
SELECT count(*)
FROM places p
JOIN allowed_categories ac ON array_has(p.fsq_category_ids, ac.category_id)
WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days';
Expected behavior
TL;DR, I expect the queries to execute fairly quickly (low single digit number of minutes, with some reasonable variance for internet... I'm literally on the other side of the world from us-east-1 with a reasonably fast connection).
I also expect the query execution time, at least for this particular case, to not significantly vary based on the number of categories filtered.
Finally, I'd expect the CPU usage to either go down from current levels like DataFusion 49, OR if it's going to leverage more cores, to actually go faster ;) (DataFusion 49 rarely used more than a single core computing the count as described in Query 1, and did it much faster.)
Let me explain what I observed both queries doing under different DataFusion releases...
Query 1
Query 1 takes approximately 2 and a half minutes for me on DataFusion 49.0.2. This is the baseline and seems reasonable. Increasing the LIMIT to 1000 (more realistic), the execution time remains constant. CPU usage is moderate; it does not need to saturate all cores.
Under DataFusion 50, this just immediately consumes 100% of all CPU (not even bouncing; straight up redline haha), and doesn't seem to do much in the way of data transfer. I killed the query after around 20 minutes, since it was clear that it wasn't going to be able to complete in a reasonable time without any data transfer.
This behavior in DataFusion 50 occurs regardless of what the LIMIT value is. Even a LIMIT 10
sends all CPUs spinning, and no significant data transfer ever occurs.
I don't know how relevant this tidbit is, but unlike most flavors of high CPU utilization, this doesn't make my MacBook hot to the touch or even trigger fans. System load averages are around 100, but it's clearly not making any progress 🤷 No unusual memory activity either.
Query 2
I discovered Query 2 when looking for a workaround for DataFusion 50. It expresses a similar intent, and still uses an array operation (array_has
in this case), but it's expressed more explicitly as a join condition rather than a weird subquery. However, it has wildly different performance characteristics.
On DataFusion 49, this consumes a fair bit of CPU, than Query 1, and it took around 22 minutes! That's like 10x longer than the original query.
DataFusion 50 fares a bit better better. It uses even more CPU than 49 does initially, and seems to start pulling from the network sooner/faster. It completes this query in just under 8 minutes.
So that's an improvement, right? Well... if you try doubling the LIMIT
to 1000 (more representative of my typical query), the run time (DataFusion 50) approximately doubles to 15 minutes! This is a bit strange. But at least it finishes 😅
Reduced throughput streaming results
What I've included above is just the count which we do beforehand by calling .count()
with a more useful select query. Once we start actually streaming the results, throughput is unsurprisingly slower than before. I'm seeing somewhere between 10% and (absolute best case 50%) of the former throughput (in records/second) through our pipeline.
Unfortunately the pipeline has too many downstream variables for me to cleanly disentangle that into an MRE without a lot of effort, but I guess I shouldn't be too surprised by this.
Additional context
Versions and test methods
I initially observed the regression on several crates.io releases as a dependency in our internal application (release build). I then reproduced using the respective datafusion-cli versions to make things easier to write up.
For DataFusion 49, I checked out the latest commit on branch-49
, which is tagged as 49.0.2 and has hash f43df3f. I then built datafusion-cli in release mode.
For DataFusion 50, I used CLI version 50.2.0 via homebrew.
All observations are on an M1 Max MacBook Pro running macOS Sequoia 15.6.1.
I have not enabled any specific session context changes (code) or set any explicit options in the CLI.
Notes on category filter
The "real world" version of this query uses a computed VIEW with a long WHERE clause. A size-tunable simple CTE is simpler and better for the purpose of illustration though.
The "real" category filter includes about 1100 categories, and is a significant portion of the raw category list (1245 entries total). I have tested multiple variations of this, and there is (as one would expect) no difference between referencing the view and any random sample of 1100 categories.
Size of the array seems to matter?
array_has_any
with a small cardinality list seems to be fine. For example, you can add (unresolved_flags IS NULL OR NOT array_has_any(unresolved_flags, ['inappropriate', 'duplicate']))
to the WHERE clause, it's fine. This actually may seem to contradict what I wrote above about LIMIT 10
not working on Query 1. Maybe there's something with static arrays vs computed ones? I have not had a chance to explore this yet.