-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
This is part of a larger discussion for improving the APIs for Join planning in DataFusion
Specifically, many more sophisticated join ordering algorithms rely on a data structure called a "JoinGraph" (defined below)
In DataFusin terms, a JoinGraph could be formed from any tree of LogicalPlan::Join
or corresponding PhysicalPlan
Some ideas:
/// represents a join graph
struct JoinGraph {
...
}
impl JoinGraph {
// make a new join graph from a plan
// NOTE this maybe should be in terms of physical plans / `Arc<dyn ExecutionPlan>`
fn new_from_plan(logical_plan: LogicalPlan) -> Result<Self> {..}
...
// create a new plan from a join grap
fn into_plan(self) -> Result<LogicalPlan>
}
Join Graphs
Quoting a pretty good Google AI summary:
A join graph in the context of relational databases is a data structure that represents the relationships and potential join operations between tables in a relational database schema.
Here's a breakdown of its definition:
- Vertices (Nodes): Each vertex in a join graph corresponds to a table (or relation) in the relational database schema.
- Edges: An edge between two vertices (tables) signifies a join relationship between those tables (e.g an equality predicate).
- Edge Labels (Join Conditions): Edges are labeled with the specific join conditions that define how the connected tables are to be joined. These conditions specify the attributes that must match for a successful join (e.g., TableA.ID = TableB.TableA_ID).
Join graphs are used to:
- Plan Query Execution: Database query optimizers can use join graphs to identify efficient ways to execute queries involving multiple joins.
For example, TPCH Q3 is expressed in SQL like this:
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '1995-03-15'
and l_shipdate > date '1995-03-15'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
limit 10;
It can be represented with the following JoinGraph (the edges of the arrows represent foreign key --> Primary Key constraints)
graph TD
customer["customer"]
orders["orders"]
lineitem["lineitem"]
customer -->|c_custkey = o_custkey| orders
orders -->|o_orderkey = l_orderkey| lineitem