Skip to content

Conversation

jonathanc-n
Copy link
Contributor

@jonathanc-n jonathanc-n commented Sep 9, 2025

Which issue does this PR close?

Rationale for this change

Adds regular joins (left, right, full, inner) for PWMJ as they behave differently in the code path.

What changes are included in this PR?

Adds classic join + physical planner

Are these changes tested?

Yes SLT tests + unit tests

Follow up work to this pull request

  • Handling partitioned queries and multiple record batches (fuzz testing will be handled with this)
  • Simplify physical planning
  • Add more unit tests for different types (another pr as the LOC in this pr is getting a little daunting)

next would be to implement the existence joins

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Sep 9, 2025
@jonathanc-n jonathanc-n marked this pull request as draft September 9, 2025 04:03
@jonathanc-n
Copy link
Contributor Author

@2010YOUY01 Would you like to take a look at if this is how you wanted to split up the work? I just wanted to put this out today then i'll clean it up better this week. Only failing one external test currently.

if join_filter.is_none() && matches!(join_type, JoinType::Inner) {
// cross join if there is no join conditions and no join filter set
Arc::new(CrossJoinExec::new(physical_left, physical_right))
} else if num_range_filters == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to refactor this in another pull request, just a refactor but it should be quite simple to do. Just wanted to get this version in first.

statement ok
set datafusion.execution.batch_size = 8192;

# TODO: partitioned PWMJ execution
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently doesn't allow partitioned execution, this would make reviewing the tests a little messy as many of the partitioned single range queries would switch to PWMJ. Another follow up, will be tracked in #17427

@jonathanc-n jonathanc-n marked this pull request as ready for review September 9, 2025 17:59
@jonathanc-n
Copy link
Contributor Author

cc @2010YOUY01 @comphead this pr is now ready!

@jonathanc-n jonathanc-n changed the title POC: ClassicJoin for PWMJ feat: ClassicJoin for PWMJ Sep 9, 2025
@2010YOUY01
Copy link
Contributor

This is great! I have some suggestions for the planning part, and I'll review the execution part tomorrow.

Refactor the in-equality extracting logic

I suggest to move the inequality-extracting logic from physical_planner.rs into https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/extract_equijoin_predicate.rs

The reason is we'd better put similar code into a single place, instead of let it scatter to multiple places. ExtractEquijoinPredicate logical optimizer rule is extracting equality join predicates like t1.v1 = t2.v1, here we want to extract t1.v1 < t2.v1, their logic should be very similar.

To do this I think we need to extend the logical plan join node with extra ie predicate field (maybe we can define a new struct for IE predicate with (Expr, Op, Expr), and we can also use that in other places)

/// Join two logical plans on one or more join columns
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Join {
    ...
    /// Equijoin clause expressed as pairs of (left, right) join expressions
    pub on: Vec<(Expr, Expr)>,                                                                 
    /// In-equility clause expressed as pairs of (left, right) join expressions           <-- HERE
    pub ie_predicates: Vec<(Expr, IEOp, Expr)>,
    /// Filters applied during join (non-equi conditions)
    pub filter: Option<Expr>,
    ...
}

To make it compatible for systems only use the LogicalPlan API, but not the physical plans, we can also provide a utility to move the IE predicates back to the filter:

Before: 
ie_predicates: [t1.v1 < t2.v1, t1.v2 < t2.v2]
filter: (t1.v3 + t2.v3) = 100

After:
ie_predicates: []
filter: ((t1.v3 + t2.v3) = 100) AND (t1.v1 < t2.v1) AND (t1.v2 < t2.v2)

Perhaps we can open a PR only for this IE predicates extracting task, and during the initial planning we can simply move the IE predicates back to the filter with the above mentioned utility.

Make it configurable to turn on/off PWMJ

I'll try to finish #17467 soon to make it easier, so let's put this on hold for now.

@comphead
Copy link
Contributor

Thanks @jonathanc-n and @2010YOUY01

#17467 definitely would be nice to have as PWMJ can start as optional experimental join, which would be separately documented, showing benefits and limitations for the end user. Actually the same happened for SMJ being experimental feature for quite some time.

Another great point to identify bottlenecks in performance is to absorb some knowledge from #17488 and keep the join more stable.

As optional feature it is pretty safe to go, again referring to SMJ there was a separate ticket which post launch checks to make sure it is safe to use like #9846

Let me know your thoughts?

@jonathanc-n
Copy link
Contributor Author

jonathanc-n commented Sep 11, 2025

Yes I think the experimental flag should be added first and we can do the equality extraction logic as a follow up. WDYT @2010YOUY01 Do you think you want to get #17467 before this one?

@2010YOUY01
Copy link
Contributor

Yes I think the experimental flag should be added first and we can do the equality extraction logic as a follow up. WDYT @2010YOUY01 Do you think you want to get #17467 before this one?

Yes, so let's do other work first. If I can't get #17467 done when this PR is ready, let's add enable_piecewise_merge_join option here -- I think we can agree on this configuration.

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone over the exec.rs, and will continue with the stream implementation part soon.

ExecutionPlan, PlanProperties,
};
use crate::{DisplayAs, DisplayFormatType, ExecutionPlanProperties};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the best module comments I have seen.

@github-actions github-actions bot added the common Related to common crate label Sep 14, 2025
@jonathanc-n
Copy link
Contributor Author

@2010YOUY01 I have added the requested changes! Should be good for another go.

@jonathanc-n
Copy link
Contributor Author

@comphead Should a flag be added to let this be optional, like alllow_pwmj_execution or something along those lines?

@github-actions github-actions bot removed the common Related to common crate label Sep 15, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look through the classic_join.rs, the general structure looks great. I left some major issues I'd like to tackle first.

The goal now is to ensure it's significantly faster than NLJ, I ran some micro-bench and found it's slower, so I'd like to better understand its implementation and make it faster.

> set datafusion.execution.target_partitions = 1;
0 row(s) fetched.
Elapsed 0.001 seconds.
> SELECT *
        FROM range(30000) AS t1
        INNER JOIN range(30000) AS t2
        ON (t1.value > t2.value);
...
885262824 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.840 seconds.
> SELECT *
        FROM range(30000) AS t1
        FULL JOIN range(30000) AS t2
        ON (t1.value > t2.value);
...

885262825 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 1.592 seconds.

They're Q11 and Q12 from https://github.com/apache/datafusion/blob/main/benchmarks/src/nlj.rs
Using NLJ they're both around 0.55s, also the results don't match.


The remaining part for me to review:

  • minor issues in classic_join.rs
  • test coverage


// For Left, Right, Full, and Inner joins, incoming stream batches will already be sorted.
#[allow(clippy::too_many_arguments)]
fn resolve_classic_join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the implementation of this function is quite hard to understand, is it possible to structure this way:

// Materialize the result when possible
if batch_process_state.has_ready_batch() {
    return Ok(batch_process_state.finish());
}
// Else advancing the stream/buffer side index, and put the matched indices into `batch_process_state` for it to materialize incrementally later
// ...

@jonathanc-n
Copy link
Contributor Author

jonathanc-n commented Sep 16, 2025

I'll try to complete all the refactoring tomorrow. The performance may be due to the sides that are being used, I will need to take a look into that.

The results don't match because it currently doesnt allow for execution of more than record batch

@jonathanc-n
Copy link
Contributor Author

The performance saw a similar hit in #16660 (the benchmark is in the description). I think I can tune when to use this join based on the incoming size in a follow up, for now the config will restrain this join to keep it purely experimental

@jonathanc-n
Copy link
Contributor Author

How did you get the incorrect result? I'm testing query 12 and it doesnt optimize into a piecewisemergejoin

@2010YOUY01
Copy link
Contributor

The performance saw a similar hit in #16660 (the benchmark is in the description). I think I can tune when to use this join based on the incoming size in a follow up, for now the config will restrain this join to keep it purely experimental

That bench doesn't include sort time, PWMJ should be faster than NLJ even it includes the sorting overhead (n*log(n) v.s. n^2).

I think the main motivation for adding this executor is its performance advantage, so we probably shouldn’t merge an initial PR without first getting it to a good performance level. (Also, since there aren’t many merge conflicts to resolve for this PR, I don’t think there’s any rush.)

I can help diagnose it later.

How did you get the incorrect result? I'm testing query 12 and it doesnt optimize into a piecewisemergejoin

set datafusion.execution.target_partitions = 1; This config should get PWMJ triggered

Moreover we should get sqlite extended test passed later, either through configuring target_partition to 1 or enable parallel execution for it (BTW why are't we support large target_partitions now? it seen not requiring lots of change, only the stream side have to be round-robin repartitioned)

@github-actions github-actions bot added the common Related to common crate label Oct 1, 2025
@jonathanc-n
Copy link
Contributor Author

@2010YOUY01 I have made quite a bit changes:

New Changes

  • Incremental output logic now just uses BatchCoalescer which makes all the logic a lot simpler
  • Instead of using take() it can improve performance by using slice()
  • Added RoundRobin Partitioning for right inputs. and left inputs are single partitioned
  • Added remaining_partitions to synchronize when to actually check unmatched indices. Enforces that building unmatched indices only happens once.
  • Added some additional tests

Current benchmarks use larger table on left side and smaller on right. Swap inputs will be supported in a follow up.

Benchmarks

Benchmarks look very good. To run it yourself you can replace the queries in nlj.rs with the queries below and make sure to remove the NLJ only restriction.

These queries use < or <= (causes descending sort) which means these benchmarks account for sorting.

Full Joins

Queries

Click to expand

r#"
        SELECT *
        FROM range(30000) AS t1
        FULL JOIN range(100)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        FULL JOIN range(1000) AS t2
        ON (t1.value <= t2.value);
    "#,

    r#"
        SELECT *
        FROM range(30000) AS t1
        FULL JOIN range(8000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(100000) AS t1
        FULL JOIN range(1000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(200000) AS t1
        FULL JOIN range(8000)   AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(500000) AS t1
        FULL JOIN range(2000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(50000) AS t1
        FULL JOIN range(5000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(80000) AS t1
        FULL JOIN range(7000)  AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(7000) AS t1
        FULL JOIN range(1000)  AS t2
        ON (t1.value <= t2.value);
    "#,

    r#"
        set datafusion.optimizer.allow_piecewise_merge_join=true
    "#,

    r#"
        SELECT *
        FROM range(30000) AS t1
       FULL  JOIN range(100)  AS t2
        ON (t1.value < t2.value);
    "#,

    r#"
        SELECT *
        FROM range(30000) AS t1
        FULL JOIN range(1000) AS t2
        ON (t1.value <= t2.value);
    "#,

    r#"
        SELECT *
        FROM range(30000) AS t1
        FULL JOIN range(8000)  AS t2
        ON (t1.value < t2.value);
    "#,

    r#"
        SELECT *
        FROM range(100000) AS t1
        FULL JOIN range(1000)   AS t2
        ON (t1.value < t2.value);
    "#,

    r#"
        SELECT *
        FROM range(200000) AS t1
        FULL JOIN range(8000)   AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(500000) AS t1
        FULL JOIN range(2000)   AS t2
        ON (t1.value < t2.value);
    "#,

    r#"
        SELECT *
        FROM range(50000) AS t1
        FULL JOIN range(5000)  AS t2
        ON (t1.value < t2.value);
    "#,

    r#"
        SELECT *
        FROM range(80000) AS t1
        FULL JOIN range(7000)  AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(7000) AS t1
        FULL JOIN range(1000)  AS t2
        ON (t1.value <= t2.value);
    "#,

Results

Click to expand

Query 1 iteration 0 returned 34852 rows in 16.05125ms
Query 1 iteration 1 returned 34852 rows in 11.752291ms
Query 1 iteration 2 returned 34852 rows in 9.914917ms
Query 2 iteration 0 returned 529500 rows in 16.1965ms
Query 2 iteration 1 returned 529500 rows in 15.085ms
Query 2 iteration 2 returned 529500 rows in 14.240292ms
Query 3 iteration 0 returned 32018002 rows in 120.564584ms
Query 3 iteration 1 returned 32018002 rows in 120.938292ms
Query 3 iteration 2 returned 32018002 rows in 123.329541ms
Query 4 iteration 0 returned 598502 rows in 43.972708ms
Query 4 iteration 1 returned 598502 rows in 44.160041ms
Query 4 iteration 2 returned 598502 rows in 44.264125ms
Query 5 iteration 0 returned 32196000 rows in 495.291458ms
Query 5 iteration 1 returned 32196000 rows in 509.875792ms
Query 5 iteration 2 returned 32196000 rows in 508.242708ms
Query 6 iteration 0 returned 2497002 rows in 317.830542ms
Query 6 iteration 1 returned 2497002 rows in 324.181541ms
Query 6 iteration 2 returned 2497002 rows in 319.515667ms
Query 7 iteration 0 returned 12542502 rows in 99.123666ms
Query 7 iteration 1 returned 12542502 rows in 98.18625ms
Query 7 iteration 2 returned 12542502 rows in 98.21575ms
Query 8 iteration 0 returned 24576500 rows in 206.997417ms
Query 8 iteration 1 returned 24576500 rows in 205.0475ms
Query 8 iteration 2 returned 24576500 rows in 201.080625ms
Query 9 iteration 0 returned 506500 rows in 4.351875ms
Query 9 iteration 1 returned 506500 rows in 4.322667ms
Query 9 iteration 2 returned 506500 rows in 4.290833ms
using pwmj for 10
Query 10 iteration 0 returned 0 rows in 25.458µs
Query 10 iteration 1 returned 0 rows in 21.958µs
Query 10 iteration 2 returned 0 rows in 21.334µs
using pwmj for 11
Query 11 iteration 0 returned 34852 rows in 850.875µs
Query 11 iteration 1 returned 34852 rows in 616.75µs
Query 11 iteration 2 returned 34852 rows in 613.292µs
using pwmj for 12
Query 12 iteration 0 returned 529500 rows in 1.581291ms
Query 12 iteration 1 returned 529500 rows in 1.555667ms
Query 12 iteration 2 returned 529500 rows in 1.574417ms
using pwmj for 13
Query 13 iteration 0 returned 32018002 rows in 50.687208ms
Query 13 iteration 1 returned 32018002 rows in 51.185ms
Query 13 iteration 2 returned 32018002 rows in 52.096125ms
using pwmj for 14
Query 14 iteration 0 returned 598502 rows in 3.986583ms
Query 14 iteration 1 returned 598502 rows in 4.21625ms
Query 14 iteration 2 returned 598502 rows in 3.984708ms
using pwmj for 15
Query 15 iteration 0 returned 32196000 rows in 56.680583ms
Query 15 iteration 1 returned 32196000 rows in 56.67925ms
Query 15 iteration 2 returned 32196000 rows in 57.5255ms
using pwmj for 16
Query 16 iteration 0 returned 2497002 rows in 20.027667ms
Query 16 iteration 1 returned 2497002 rows in 19.987083ms
Query 16 iteration 2 returned 2497002 rows in 20.5755ms
using pwmj for 17
Query 17 iteration 0 returned 12542502 rows in 21.576958ms
Query 17 iteration 1 returned 12542502 rows in 21.268875ms
Query 17 iteration 2 returned 12542502 rows in 21.101625ms
using pwmj for 18
Query 18 iteration 0 returned 24576500 rows in 41.632166ms
Query 18 iteration 1 returned 24576500 rows in 42.199417ms
Query 18 iteration 2 returned 24576500 rows in 41.981208ms
using pwmj for 19
Query 19 iteration 0 returned 506500 rows in 1.459417ms
Query 19 iteration 1 returned 506500 rows in 1.31825ms
Query 19 iteration 2 returned 506500 rows in 1.320333ms

Left Joins

Queries

Click to expand

r#"
        SELECT *
        FROM range(30000) AS t1
        LEFT JOIN range(100)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        LEFT JOIN range(1000) AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        LEFT JOIN range(8000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(100000) AS t1
        LEFT JOIN range(1000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(200000) AS t1
        LEFT JOIN range(8000)   AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(500000) AS t1
        LEFT JOIN range(2000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(50000) AS t1
        LEFT JOIN range(5000)  AS t2
        ON (t1.value < t2.value);
    "#,

    r#"
        SELECT *
        FROM range(80000) AS t1
        LEFT JOIN range(7000)  AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        set datafusion.optimizer.allow_piecewise_merge_join=true
    "#,

    r#"
        SELECT *
        FROM range(30000) AS t1
        LEFT JOIN range(100)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        LEFT JOIN range(1000) AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        LEFT JOIN range(8000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(100000) AS t1
        LEFT JOIN range(1000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(200000) AS t1
        LEFT JOIN range(8000)   AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(500000) AS t1
        LEFT JOIN range(2000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(50000) AS t1
        LEFT JOIN range(5000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(80000) AS t1
        LEFT JOIN range(7000)  AS t2
        ON (t1.value <= t2.value);
    "#,
    

Results:

Click to expand

Query 1 iteration 0 returned 34851 rows in 13.563958ms
Query 1 iteration 1 returned 34851 rows in 10.809917ms
Query 1 iteration 2 returned 34851 rows in 9.401708ms
Query 2 iteration 0 returned 529500 rows in 16.359917ms
Query 2 iteration 1 returned 529500 rows in 14.207666ms
Query 2 iteration 2 returned 529500 rows in 13.391625ms
Query 3 iteration 0 returned 32018001 rows in 119.216542ms
Query 3 iteration 1 returned 32018001 rows in 124.351333ms
Query 3 iteration 2 returned 32018001 rows in 125.460291ms
Query 4 iteration 0 returned 598501 rows in 43.042167ms
Query 4 iteration 1 returned 598501 rows in 41.469958ms
Query 4 iteration 2 returned 598501 rows in 42.400917ms
Query 5 iteration 0 returned 32196000 rows in 484.91175ms
Query 5 iteration 1 returned 32196000 rows in 501.124291ms
Query 5 iteration 2 returned 32196000 rows in 483.698208ms
Query 6 iteration 0 returned 2497001 rows in 314.219708ms
Query 6 iteration 1 returned 2497001 rows in 312.403458ms
Query 6 iteration 2 returned 2497001 rows in 316.319542ms
Query 7 iteration 0 returned 12542501 rows in 97.666ms
Query 7 iteration 1 returned 12542501 rows in 99.839541ms
Query 7 iteration 2 returned 12542501 rows in 97.520875ms
Query 8 iteration 0 returned 24576500 rows in 205.489833ms
Query 8 iteration 1 returned 24576500 rows in 199.618292ms
Query 8 iteration 2 returned 24576500 rows in 205.075834ms
using pwmj for 9
Query 9 iteration 0 returned 0 rows in 25.583µs
Query 9 iteration 1 returned 0 rows in 22.25µs
Query 9 iteration 2 returned 0 rows in 20.459µs
using pwmj for 10
Query 10 iteration 0 returned 34851 rows in 889.5µs
Query 10 iteration 1 returned 34851 rows in 644.875µs
Query 10 iteration 2 returned 34851 rows in 674.542µs
using pwmj for 11
Query 11 iteration 0 returned 529500 rows in 1.657375ms
Query 11 iteration 1 returned 529500 rows in 1.7705ms
Query 11 iteration 2 returned 529500 rows in 1.71175ms
using pwmj for 12
Query 12 iteration 0 returned 32018001 rows in 56.604542ms
Query 12 iteration 1 returned 32018001 rows in 54.31675ms
Query 12 iteration 2 returned 32018001 rows in 58.129333ms
using pwmj for 13
Query 13 iteration 0 returned 598501 rows in 4.146833ms
Query 13 iteration 1 returned 598501 rows in 4.068125ms
Query 13 iteration 2 returned 598501 rows in 4.014459ms
using pwmj for 14
Query 14 iteration 0 returned 32196000 rows in 62.61375ms
Query 14 iteration 1 returned 32196000 rows in 63.86275ms
Query 14 iteration 2 returned 32196000 rows in 63.588916ms
using pwmj for 15
Query 15 iteration 0 returned 2497001 rows in 20.669417ms
Query 15 iteration 1 returned 2497001 rows in 20.772416ms
Query 15 iteration 2 returned 2497001 rows in 20.07075ms
using pwmj for 16
Query 16 iteration 0 returned 12542501 rows in 23.259375ms
Query 16 iteration 1 returned 12542501 rows in 22.053ms
Query 16 iteration 2 returned 12542501 rows in 23.912958ms
using pwmj for 17
Query 17 iteration 0 returned 24576500 rows in 46.321708ms
Query 17 iteration 1 returned 24576500 rows in 44.81025ms
Query 17 iteration 2 returned 24576500 rows in 47.482666ms

Right Joins

Queries

Click to expand

r#"
        SELECT *
        FROM range(30000) AS t1
        RIGHT JOIN range(100)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        RIGHT JOIN range(1000) AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        RIGHT JOIN range(8000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(100000) AS t1
        RIGHT JOIN range(1000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(200000) AS t1
        RIGHT JOIN range(8000)   AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(500000) AS t1
        RIGHT JOIN range(2000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(50000) AS t1
        RIGHT JOIN range(5000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(80000) AS t1
        RIGHT JOIN range(7000)  AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        set datafusion.optimizer.allow_piecewise_merge_join=true
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        RIGHT JOIN range(100)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        RIGHT JOIN range(1000) AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(30000) AS t1
        RIGHT JOIN range(8000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(100000) AS t1
        RIGHT JOIN range(1000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(200000) AS t1
        RIGHT JOIN range(8000)   AS t2
        ON (t1.value <= t2.value);
    "#,
    r#"
        SELECT *
        FROM range(500000) AS t1
        RIGHT JOIN range(2000)   AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(50000) AS t1
        RIGHT JOIN range(5000)  AS t2
        ON (t1.value < t2.value);
    "#,
    r#"
        SELECT *
        FROM range(80000) AS t1
        RIGHT JOIN range(7000)  AS t2
        ON (t1.value <= t2.value);
    "#,

Results

Click to expand

Query 1 iteration 0 returned 4951 rows in 15.904458ms
Query 1 iteration 1 returned 4951 rows in 11.353667ms
Query 1 iteration 2 returned 4951 rows in 10.1235ms
Query 2 iteration 0 returned 500500 rows in 16.099917ms
Query 2 iteration 1 returned 500500 rows in 14.73375ms
Query 2 iteration 2 returned 500500 rows in 14.110875ms
Query 3 iteration 0 returned 31996001 rows in 123.070583ms
Query 3 iteration 1 returned 31996001 rows in 125.751833ms
Query 3 iteration 2 returned 31996001 rows in 122.032792ms
Query 4 iteration 0 returned 499501 rows in 45.082459ms
Query 4 iteration 1 returned 499501 rows in 44.246958ms
Query 4 iteration 2 returned 499501 rows in 44.218292ms
Query 5 iteration 0 returned 32004000 rows in 494.880083ms
Query 5 iteration 1 returned 32004000 rows in 497.626709ms
Query 5 iteration 2 returned 32004000 rows in 491.031125ms
Query 6 iteration 0 returned 1999001 rows in 319.75ms
Query 6 iteration 1 returned 1999001 rows in 327.817708ms
Query 6 iteration 2 returned 1999001 rows in 320.286208ms
Query 7 iteration 0 returned 12497501 rows in 100.778041ms
Query 7 iteration 1 returned 12497501 rows in 98.214708ms
Query 7 iteration 2 returned 12497501 rows in 98.079084ms
Query 8 iteration 0 returned 24503500 rows in 203.049042ms
Query 8 iteration 1 returned 24503500 rows in 204.667958ms
Query 8 iteration 2 returned 24503500 rows in 206.985375ms
using pwmj for 9
Query 9 iteration 0 returned 0 rows in 26.625µs
Query 9 iteration 1 returned 0 rows in 22.875µs
Query 9 iteration 2 returned 0 rows in 20.208µs
using pwmj for 10
Query 10 iteration 0 returned 4951 rows in 708µs
Query 10 iteration 1 returned 4951 rows in 554.792µs
Query 10 iteration 2 returned 4951 rows in 557.792µs
using pwmj for 11
Query 11 iteration 0 returned 500500 rows in 1.338708ms
Query 11 iteration 1 returned 500500 rows in 1.315292ms
Query 11 iteration 2 returned 500500 rows in 1.318208ms
using pwmj for 12
Query 12 iteration 0 returned 31996001 rows in 36.270542ms
Query 12 iteration 1 returned 31996001 rows in 35.548916ms
Query 12 iteration 2 returned 31996001 rows in 37.627291ms
using pwmj for 13
Query 13 iteration 0 returned 499501 rows in 3.288875ms
Query 13 iteration 1 returned 499501 rows in 3.284625ms
Query 13 iteration 2 returned 499501 rows in 3.224916ms
using pwmj for 14
Query 14 iteration 0 returned 32004000 rows in 42.140333ms
Query 14 iteration 1 returned 32004000 rows in 41.813167ms
Query 14 iteration 2 returned 32004000 rows in 41.961667ms
using pwmj for 15
Query 15 iteration 0 returned 1999001 rows in 16.863042ms
Query 15 iteration 1 returned 1999001 rows in 17.2575ms
Query 15 iteration 2 returned 1999001 rows in 16.82575ms
using pwmj for 16
Query 16 iteration 0 returned 12497501 rows in 16.231792ms
Query 16 iteration 1 returned 12497501 rows in 15.059792ms
Query 16 iteration 2 returned 12497501 rows in 15.160083ms
using pwmj for 17
Query 17 iteration 0 returned 24503500 rows in 30.725667ms
Query 17 iteration 1 returned 24503500 rows in 30.282084ms
Query 17 iteration 2 returned 24503500 rows in 30.001167ms

Inner Joins

Queries

Click to expand

r#"
    SELECT *
    FROM range(30000) AS t1
    INNER JOIN range(100)  AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(30000) AS t1
    INNER JOIN range(1000) AS t2
    ON (t1.value <= t2.value);
"#,
r#"
    SELECT *
    FROM range(30000) AS t1
    INNER JOIN range(8000)  AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(100000) AS t1
    INNER JOIN range(1000)   AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(200000) AS t1
    INNER JOIN range(8000)   AS t2
    ON (t1.value <= t2.value);
"#,
r#"
    SELECT *
    FROM range(500000) AS t1
    INNER JOIN range(2000)   AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(50000) AS t1
    INNER JOIN range(5000)  AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(80000) AS t1
    INNER JOIN range(7000)  AS t2
    ON (t1.value <= t2.value);
"#,
r#"
    set datafusion.optimizer.allow_piecewise_merge_join=true
"#,
r#"
    SELECT *
    FROM range(30000) AS t1
    INNER JOIN range(100)  AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(30000) AS t1
    INNER JOIN range(1000) AS t2
    ON (t1.value <= t2.value);
"#,
r#"
    SELECT *
    FROM range(30000) AS t1
    INNER JOIN range(8000)  AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(100000) AS t1
    INNER JOIN range(1000)   AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(200000) AS t1
    INNER JOIN range(8000)   AS t2
    ON (t1.value <= t2.value);
"#,
r#"
    SELECT *
    FROM range(500000) AS t1
    INNER JOIN range(2000)   AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(50000) AS t1
    INNER JOIN range(5000)  AS t2
    ON (t1.value < t2.value);
"#,
r#"
    SELECT *
    FROM range(80000) AS t1
    INNER JOIN range(7000)  AS t2
    ON (t1.value <= t2.value);
"#,

Results

Click to expand

Query 1 iteration 0 returned 4950 rows in 13.815208ms
Query 1 iteration 1 returned 4950 rows in 10.542709ms
Query 1 iteration 2 returned 4950 rows in 9.745042ms
Query 2 iteration 0 returned 500500 rows in 15.482083ms
Query 2 iteration 1 returned 500500 rows in 14.565542ms
Query 2 iteration 2 returned 500500 rows in 13.728666ms
Query 3 iteration 0 returned 31996000 rows in 126.463708ms
Query 3 iteration 1 returned 31996000 rows in 121.749166ms
Query 3 iteration 2 returned 31996000 rows in 120.915709ms
Query 4 iteration 0 returned 499500 rows in 42.01275ms
Query 4 iteration 1 returned 499500 rows in 42.730833ms
Query 4 iteration 2 returned 499500 rows in 101.3445ms
Query 5 iteration 0 returned 32004000 rows in 511.74725ms
Query 5 iteration 1 returned 32004000 rows in 510.05675ms
Query 5 iteration 2 returned 32004000 rows in 498.320875ms
Query 6 iteration 0 returned 1999000 rows in 315.585917ms
Query 6 iteration 1 returned 1999000 rows in 314.887791ms
Query 6 iteration 2 returned 1999000 rows in 313.276792ms
Query 7 iteration 0 returned 12497500 rows in 98.969ms
Query 7 iteration 1 returned 12497500 rows in 100.670583ms
Query 7 iteration 2 returned 12497500 rows in 98.817625ms
Query 8 iteration 0 returned 24503500 rows in 220.258166ms
Query 8 iteration 1 returned 24503500 rows in 205.282709ms
Query 8 iteration 2 returned 24503500 rows in 204.11275ms
using pwmj for 9
Query 9 iteration 0 returned 0 rows in 27.667µs
Query 9 iteration 1 returned 0 rows in 22.75µs
Query 9 iteration 2 returned 0 rows in 21.375µs
using pwmj for 10
Query 10 iteration 0 returned 4950 rows in 721.041µs
Query 10 iteration 1 returned 4950 rows in 533.333µs
Query 10 iteration 2 returned 4950 rows in 596.5µs
using pwmj for 11
Query 11 iteration 0 returned 500500 rows in 1.363375ms
Query 11 iteration 1 returned 500500 rows in 1.469292ms
Query 11 iteration 2 returned 500500 rows in 1.374417ms
using pwmj for 12
Query 12 iteration 0 returned 31996000 rows in 36.737875ms
Query 12 iteration 1 returned 31996000 rows in 37.10075ms
Query 12 iteration 2 returned 31996000 rows in 37.435667ms
using pwmj for 13
Query 13 iteration 0 returned 499500 rows in 3.59ms
Query 13 iteration 1 returned 499500 rows in 3.438667ms
Query 13 iteration 2 returned 499500 rows in 3.273291ms
using pwmj for 14
Query 14 iteration 0 returned 32004000 rows in 42.230333ms
Query 14 iteration 1 returned 32004000 rows in 42.319541ms
Query 14 iteration 2 returned 32004000 rows in 42.030083ms
using pwmj for 15
Query 15 iteration 0 returned 1999000 rows in 17.155917ms
Query 15 iteration 1 returned 1999000 rows in 17.361208ms
Query 15 iteration 2 returned 1999000 rows in 17.18875ms
using pwmj for 16
Query 16 iteration 0 returned 12497500 rows in 16.114084ms
Query 16 iteration 1 returned 12497500 rows in 16.151625ms
Query 16 iteration 2 returned 12497500 rows in 15.919125ms
using pwmj for 17
Query 17 iteration 0 returned 24503500 rows in 30.728791ms
Query 17 iteration 1 returned 24503500 rows in 30.893458ms
Query 17 iteration 2 returned 24503500 rows in 30.559375ms

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 9, 2025
@2010YOUY01
Copy link
Contributor

I have generated some queries to benchmark, the result looks amazing 🚀 I think we can iterate from here, I'll review the implementation soon.

In-equality join benchmark To run it, manually swap the queries in `benchmark/nlj.rs`
const NLJ_QUERIES: &[&str] = &[
    r#"
    set datafusion.optimizer.allow_piecewise_merge_join=true
"#,
    // Q1: INNER 20 x 10K | Ultra-Low ~0.1%
    r#"
SELECT *
FROM generate_series(0, 19, 1) AS t1(v1)
JOIN generate_series(0, 9999, 1) AS t2(v2)
ON t1.v1 > t2.v2;
"#,
    // Q2: INNER 2K x 10K | Low ~10%
    r#"
SELECT *
FROM generate_series(0, 1999, 1) AS t1(v1)
JOIN generate_series(0, 9999, 1) AS t2(v2)
ON t1.v1 > t2.v2;
"#,
    // Q3: INNER 10K x 10K | Medium ~50%
    r#"
SELECT *
FROM generate_series(0, 9999, 1) AS t1(v1)
JOIN generate_series(0, 9999, 1) AS t2(v2)
ON t1.v1 > t2.v2;
"#,
    // Q4: INNER 1K x 10K | High ~95% (LHS near top)
    r#"
SELECT *
FROM generate_series(9000, 9999, 1) AS t1(v1)
JOIN generate_series(0, 9999, 1) AS t2(v2)
ON t1.v1 > t2.v2;
"#,
    // Q5: INNER 30K x 30K | Medium ~50% | comparator symmetry
    r#"
SELECT *
FROM generate_series(0, 29999, 1) AS t1(v1)
JOIN generate_series(0, 29999, 1) AS t2(v2)
ON t1.v1 < t2.v2;
"#,
    // Q6: INNER 1K x 200K | Low ~0.25% (small → large)
    r#"
SELECT *
FROM generate_series(0, 999, 1) AS t1(v1)
JOIN generate_series(0, 199999, 1) AS t2(v2)
ON t1.v1 > t2.v2;
"#,
    // Q7: INNER 200K x 10K | Med-Low ~2.5% (large → small)
    r#"
SELECT *
FROM generate_series(0, 199999, 1) AS t1(v1)
JOIN generate_series(0, 9999, 1) AS t2(v2)
ON t1.v1 < t2.v2;
"#,
    // Q8: LEFT OUTER 20K x 300 | Low ~0.75% (outer-fill path)
    r#"
SELECT *
FROM generate_series(0, 19999, 1) AS t1(v1)
LEFT JOIN generate_series(19700, 19999, 1) AS t2(v2)
ON t1.v1 > t2.v2;
"#,
    // Q9: RIGHT OUTER 100 x 10K | Zero-match (pure outer-extend check)
    r#"
SELECT *
FROM generate_series(0, 99, 1) AS t1(v1)
RIGHT JOIN generate_series(19900, 19999, 1) AS t2(v2)
ON t1.v1 > t2.v2;
"#,
];
Result: yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (pwmj *)> ./bench.sh compare nlj pwmj
Comparing nlj and pwmj
--------------------
Benchmark nlj.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Query        ┃       nlj ┃      pwmj ┃          Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ QQuery 1     │   0.25 ms │   0.02 ms │  +13.85x faster │
│ QQuery 2     │   8.86 ms │   0.30 ms │  +29.34x faster │
│ QQuery 3     │  87.69 ms │   2.70 ms │  +32.50x faster │
│ QQuery 4     │   6.92 ms │  51.96 ms │    7.51x slower │
│ QQuery 5     │ 198.44 ms │  11.43 ms │  +17.36x faster │
│ QQuery 6     │   8.08 ms │ 136.61 ms │   16.91x slower │
│ QQuery 7     │ 473.14 ms │   2.83 ms │ +167.34x faster │
│ QQuery 8     │   5.47 ms │  49.55 ms │    9.05x slower │
│ QQuery 9     │   0.20 ms │   0.63 ms │    3.12x slower │
└──────────────┴───────────┴───────────┴─────────────────┘

Note there are several slower queries, that's because one join side is very small, so the brute-force nested loop join become optimal, I suspect in some cases NLJ can even win Hash Join.
The planner/optimizer should take those cases into account, I think this is a good follow-up project to do.

When I was trying different queries, I noticed one query with full join case: piecewise merge join is slower than NLJ, but it should get faster, we should take a closer look:

set datafusion.optimizer.allow_piecewise_merge_join = true;

SELECT *
FROM range(100000) AS t1
FULL JOIN range(100000) AS t2
ON (t1.value > t2.value);

NLJ is around 1.2s on my machine, while PWMJ is around 2.5s.

@2010YOUY01
Copy link
Contributor

How about first splitting the 'in-equality predicate extracting logic' into a different PR? The idea is described in #17482 (comment)
The current planning logic is not correct, and the above link points to several existing utils to handle this case easily

> set datafusion.optimizer.allow_piecewise_merge_join = true;
0 row(s) fetched.
Elapsed 0.002 seconds.

> explain SELECT *
FROM range(100000) AS t1
INNER JOIN range(1000000) AS t2
on t1.value < (t1.value+t2.value);

thread 'main' panicked at datafusion/core/src/physical_planner.rs:1310:38:
internal error: entered unreachable code

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks great. I've left some suggestions to polish the code, I believe they're all optional.

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some suggestions for additional test coverage. And I think we should also enable pwmj in the extended sqlite test.
I think this PR is ready to go after passing those tests.

# specific language governing permissions and limitations
# under the License.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For additional test coverage, we can include the following cases in this test file:

  1. nulls in the compare key column
  2. different projections like select *
  3. Expressions in the join predicate like on t1.v1 < (t2.v1+1), it should also be able to use pwmj

@jonathanc-n
Copy link
Contributor Author

Note there are several slower queries, that's because one join side is very small, so the brute-force nested loop join become optimal, I suspect in some cases NLJ can even win Hash Join. The planner/optimizer should take those cases into account, I think this is a good follow-up project to do.

Currently for it doesnt support swapping inputs. it should be faster when the right side is smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants