|
| 1 | +# 🎯 optd |
| 2 | + |
1 | 3 | [](https://codecov.io/gh/cmu-db/optd) |
2 | 4 |
|
3 | | -# optd |
| 5 | +optd is a high-performance, extensible optimizer-as-a-service, built to support research in cardinality estimation, adaptive planning, AI-driven optimization, and parallelism. It serves as both a prototype system and a foundation for building production-ready optimizers. |
| 6 | + |
| 7 | +## ✨ Core Features |
| 8 | + |
| 9 | +**🔍 Flexible Search Strategy**: Unlike traditional recursive sub-plan optimizers, optd supports broader, non-recursive search spaces for faster and better plan discovery. |
| 10 | + |
| 11 | +**⚡ Parallelism**: |
| 12 | +- *Inter-query*: Optimize multiple queries in parallel while sharing computation |
| 13 | +- *Intra-query*: Explore a single plan's search space using many threads |
| 14 | + |
| 15 | +**💾 Persistent Memoization**: The optimizer acts like a database—plans and statistics are stored and reused, enabling adaptivity through feedback from prior executions. |
| 16 | + |
| 17 | +**📝 Rule DSL**: Define transformation rules in a high-level, expressive DSL. Our rule engine is Turing complete, enabling compact definitions of complex transformations like join order enumeration. |
| 18 | + |
| 19 | +**Example Data Type Definition**: |
| 20 | +``` |
| 21 | +data Logical = |
| 22 | + | Join(left: Logical, right: Logical, type: JoinType, predicate: Scalar) |
| 23 | + | Filter(child: Logical, predicate: Scalar) |
| 24 | + | Project(child: Logical, expressions: [Scalar]) |
| 25 | + | Sort(child: Logical, order_by: [Bool]) |
| 26 | + \ Get(table_name: String) |
| 27 | +``` |
| 28 | + |
| 29 | +**Example Transformation Rule**: |
| 30 | +``` |
| 31 | +[transformation] |
| 32 | +fn (expr: Logical*) join_commute(): Logical? = match expr |
| 33 | + | Join(left, right, Inner, predicate) -> |
| 34 | + let |
| 35 | + left_props = left.properties(), |
| 36 | + right_props = right.properties(), |
| 37 | + left_len = left_props#schema#columns.len(), |
| 38 | + right_len = right_props#schema#columns.len(), |
| 39 | + |
| 40 | + right_indices = 0..right_len, |
| 41 | + left_indices = 0..left_len, |
| 42 | + |
| 43 | + remapping = (left_indices.map((i: I64) -> (i, i + right_len)) ++ |
| 44 | + right_indices.map((i: I64) -> (i + left_len, i))).to_map(), |
| 45 | + in |
| 46 | + Project( |
| 47 | + Join(right, left, Inner, predicate.remap(remapping)), |
| 48 | + (right_indices ++ left_indices).map((i: I64) -> ColumnRef(i)) |
| 49 | + ) |
| 50 | + \ _ -> none |
| 51 | +``` |
| 52 | + |
| 53 | +**🔧 Pluggable Scheduling**: Apply rules using customizable scheduling strategies—from heuristics to AI-guided decisions. |
| 54 | + |
| 55 | +**🔍 Explainability**: Track rule application history for better debugging and plan introspection. |
| 56 | + |
| 57 | +**🔌 Extensibility**: Define custom operators and inherit existing rules. Designed to integrate with standards like Substrait, with a smoother UX than systems like Calcite. |
| 58 | + |
| 59 | +## 🛠️ Usage |
| 60 | + |
| 61 | +optd is currently under development. The costing mechanism is still being implemented, but there is a small demo available. The DSL tooling is more mature. |
| 62 | + |
| 63 | +### Running the Demo |
| 64 | + |
| 65 | +```bash |
| 66 | +# Run the demo test (located in optd/src/demo/mod.rs) |
| 67 | +cargo test test_optimizer_demo -- --nocapture |
| 68 | +``` |
| 69 | + |
| 70 | +### CLI Tool |
| 71 | + |
| 72 | +```bash |
| 73 | +# Compile a DSL file |
| 74 | +cargo run --bin optd-cli -- compile path/to/file.opt |
| 75 | + |
| 76 | +# Compile with verbose output and show intermediate representations |
| 77 | +cargo run --bin optd-cli -- compile path/to/file.opt --verbose --show-ast --show-hir |
| 78 | + |
| 79 | +# Compile with mock UDFs for testing |
| 80 | +cargo run --bin optd-cli -- compile path/to/file.opt --mock-udfs map get_table_schema properties statistics optimize |
| 81 | + |
| 82 | +# Run functions marked with [run] annotation |
| 83 | +cargo run --bin optd-cli -- run-functions path/to/file.opt |
| 84 | +``` |
| 85 | + |
| 86 | +## 🧮 TODO: How to Perform Costing |
| 87 | + |
| 88 | +Physical expressions need to be costed. Their children are either goals or other physical expressions (called goal members). Let's take the following example: `EXPR(goal_1, sub_expr_2)`. To cost that expression, we have multiple approaches: |
| 89 | + |
| 90 | +### Approach 1: Recursive Optimal Costing |
| 91 | +Recursively optimally cost `goal_1` and `sub_expr_2`. This approach is challenging because: |
| 92 | +- It requires invalidation whenever we get a better expression for `goal_1` or `sub_expr_2` |
| 93 | +- It doesn't ensure a global minimum, as greedy approaches are not always optimal |
| 94 | +- We cannot support physical→physical optimizations (if that turns out to be useful) |
| 95 | + |
| 96 | +### Approach 2: Explore All Possibilities |
| 97 | +Explore all possibilities and rely on the scheduler to avoid combinatorial explosion. This is more in line with what we do for transformations and implementations. We can define a costing function in the DSL with the following signature: |
| 98 | + |
| 99 | +``` |
| 100 | +fn (plan: Physical*) cost(): (f64, Statistics) |
| 101 | +``` |
| 102 | + |
| 103 | +`Physical*` indicates that it is stored, so it has extra guarantees (e.g., all children are ingested). This mirrors what we use for logical implementations and transformations. |
| 104 | + |
| 105 | +`f64` is the cost, and `Statistics` is any user-defined data type (could be ML weights, histograms, etc.). |
| 106 | + |
| 107 | +When we encounter a goal, we expand it and materialize all physical expressions in that goal (and subgoals!). We need new syntax to expand/cost a nested physical expression. **Idea**: `$` postfix, which means "into costed". The left type should be `Physical*`, which can easily be tested with the type checker. |
| 108 | + |
| 109 | +### Approach 3: Final Approach (Best of All Worlds) |
| 110 | + |
| 111 | +``` |
| 112 | +// This is a UDF/external function, similar to optimize for implementations |
| 113 | +fn (plan: Physical) into_costed(cost: f64, stats: Statistics) |
| 114 | +``` |
| 115 | + |
| 116 | +``` |
| 117 | +fn (plan: Physical*) cost(): Physical$ |
| 118 | +``` |
| 119 | + |
| 120 | +In the memo, each physical expression id will have a set of costed expressions: |
| 121 | +``` |
| 122 | +pid -> {pid + cost + stats} |
| 123 | +``` |
| 124 | + |
| 125 | +This approach is excellent because: |
| 126 | +1. It uses the same updating mechanisms as for implementations and explorations (consistent scheduler!) |
| 127 | +2. It allows for further physical→physical transformations |
| 128 | +3. You can do whatever you want when costing an expression! Can go as deep as needed, can choose to recursively cost if desired (or not!) |
| 129 | +4. Can propagate statistics perfectly |
| 130 | + |
| 131 | +**Only caveat**: Cost pruning has no built-in mechanism, but you can instrument the scheduler. |
| 132 | + |
| 133 | +--- |
4 | 134 |
|
5 | | -Query Optimizer Service |
| 135 | +**📧 Contact**: Please reach out to [email protected] for more information about this. |
0 commit comments