Skip to content

Consider adding a JoinGraph structure for representing joins #17719

@alamb

Description

@alamb

This is part of a larger discussion for improving the APIs for Join planning in DataFusion

Specifically, many more sophisticated join ordering algorithms rely on a data structure called a "JoinGraph" (defined below)

In DataFusin terms, a JoinGraph could be formed from any tree of LogicalPlan::Join or corresponding PhysicalPlan

Some ideas:

/// represents a join graph
struct JoinGraph {
...
}

impl JoinGraph {
  // make a new join graph from a plan
  // NOTE this maybe should be in terms of physical plans / `Arc<dyn ExecutionPlan>`
  fn new_from_plan(logical_plan: LogicalPlan) -> Result<Self> {..}
...
  // create a new plan from a join grap
  fn into_plan(self) -> Result<LogicalPlan>
}

Join Graphs

Quoting a pretty good Google AI summary:

A join graph in the context of relational databases is a data structure that represents the relationships and potential join operations between tables in a relational database schema.

Here's a breakdown of its definition:

  • Vertices (Nodes): Each vertex in a join graph corresponds to a table (or relation) in the relational database schema.
  • Edges: An edge between two vertices (tables) signifies a join relationship between those tables (e.g an equality predicate).
  • Edge Labels (Join Conditions): Edges are labeled with the specific join conditions that define how the connected tables are to be joined. These conditions specify the attributes that must match for a successful join (e.g., TableA.ID = TableB.TableA_ID).

Join graphs are used to:

  • Plan Query Execution: Database query optimizers can use join graphs to identify efficient ways to execute queries involving multiple joins.

For example, TPCH Q3 is expressed in SQL like this:

select
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    o_orderdate,
    o_shippriority
from
    customer,
    orders,
    lineitem
where
        c_mktsegment = 'BUILDING'
  and c_custkey = o_custkey
  and l_orderkey = o_orderkey
  and o_orderdate < date '1995-03-15'
  and l_shipdate > date '1995-03-15'
group by
    l_orderkey,
    o_orderdate,
    o_shippriority
order by
    revenue desc,
    o_orderdate
limit 10;

It can be represented with the following JoinGraph (the edges of the arrows represent foreign key --> Primary Key constraints)

graph TD
    customer["customer"]
    orders["orders"]
    lineitem["lineitem"]

    customer -->|c_custkey = o_custkey| orders
    orders -->|o_orderkey = l_orderkey| lineitem
Loading

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions