minsql uses a volcano-style execution model with iterator-based operators. Each operator implements a standard interface and produces tuples on demand.
User Query
↓
Lexer → Tokens
↓
Parser → AST
↓
Semantic Analysis → Intent
↓
Logical Planner → Logical Plan
↓
Optimizer → Optimized Logical Plan
↓
Physical Planner → Physical Plan
↓
Execution Engine → Results
The semantic analyzer converts the AST into a structured intent representation:
Intent::Retrieve {
projection: vec![Column("name"), Column("email")],
source: Table("users"),
filter: Some(BinaryOp {
op: GreaterThan,
left: Column("age"),
right: Literal(18)
}),
limit: Some(10)
}This intent is what the optimizer operates on.
The logical planner creates an operator tree:
Limit(10)
↓
Filter(age > 18)
↓
Project(name, email)
↓
Scan(users)
The physical planner chooses implementations:
Limit(10)
↓
Filter(age > 18)
↓
Project(name, email)
↓
IndexScan(users, users_age_idx)
Here, the scan was converted to an index scan because an appropriate index exists.
Every operator implements:
trait Operator {
fn open(&mut self) -> Result<()>;
fn next(&mut self) -> Result<Option<Tuple>>;
fn close(&mut self) -> Result<()>;
}open(): Initialize operator statenext(): Produce next tuple, or None if exhaustedclose(): Release resources
Operators are composed in a tree. Parent operators call next() on children to pull tuples.
Sequential table scan:
PhysicalPlan::SeqScan {
table: "users",
columns: vec!["id", "name", "age"]
}Implementation: Iterates through table pages, returns tuples matching column projection.
Filters tuples based on a predicate:
PhysicalPlan::Filter {
predicate: FilterIntent::Comparison {
op: GreaterThan,
left: Column("age"),
right: Constant(18)
},
input: SeqScan(...)
}Projects specific columns:
PhysicalPlan::Project {
columns: vec![Named("name"), Named("email")],
input: Filter(...)
}Fully Implemented: Writes data to storage with durability guarantees:
PhysicalPlan::Insert {
table: "users",
columns: vec!["name", "age"],
values: vec![
vec![String("Alice"), Integer(30)],
vec![String("Bob"), Integer(25)]
]
}Storage Operations:
- Converts ConstantValue to Tuple format
- Serializes tuple to JSON/bytes
- Calls
storage.insert_row(table, bytes)- writes to storage pages - Returns unique row ID for each inserted row
- Flushes WAL for durability
- Updates indexes and statistics
Guarantees:
- ACID compliant with WAL logging
- Atomic batch inserts
- Immediate durability via WAL flush
- Row ID tracking for references
Fully Implemented: Modifies existing rows in storage:
PhysicalPlan::Update {
table: "users",
assignments: vec![
Assignment { column: "age", value: Integer(31) }
],
filter: Some(Comparison { ... })
}Storage Operations:
- Builds filter predicate for row matching
- Serializes assignments to bytes
- Calls
storage.update_rows(table, filter, assignments) - Storage engine scans and updates matching rows
- Returns count of updated rows
- Flushes WAL for durability
- Updates indexes automatically
Guarantees:
- Transactional updates
- Filter-based row matching
- Index consistency maintained
- WAL-based crash recovery
Fully Implemented: Removes rows from storage:
PhysicalPlan::Delete {
table: "users",
filter: Some(LessThan { ... })
}Storage Operations:
- Builds filter predicate
- Calls
storage.delete_rows(table, filter) - Storage marks/removes matching rows
- Updates free space map
- Returns count of deleted rows
- Flushes WAL for durability
- Updates indexes and statistics
Guarantees:
- Transactional deletes
- Space reclamation
- Index cleanup
- Crash-safe via WAL
Fully Implemented: Creates tables with persistent schema:
PhysicalPlan::CreateTable {
name: "products",
columns: vec![
ColumnDefinition {
name: "id",
data_type: Integer,
nullable: false,
primary_key: true
},
ColumnDefinition {
name: "name",
data_type: Text,
nullable: false,
primary_key: false
}
]
}Storage Operations:
- Builds schema metadata (JSON format)
- Stores schema in system catalog via
storage.create_table() - Allocates initial storage pages
- Creates primary key index if specified
- Flushes WAL + checkpoint for durability
- Schema persisted for crash recovery
Guarantees:
- Schema stored in system catalog
- Transactional DDL operations
- Checkpoint ensures durability
- Ready for immediate INSERT operations
Reads tuples from storage:
SeqScan: Full table scanIndexScan: Index-based accessBitmapScan: Bitmap index scan for multiple conditions
Evaluates a predicate on each tuple:
Filter(age > 18 AND active = true)
Only tuples satisfying the predicate are passed to the parent.
Extracts specific columns:
Project(name, email)
Produces tuples containing only the projected columns.
Combines tuples from two inputs:
NestedLoopJoin: Simple nested loopHashJoin: Hash-based join for equality predicatesMergeJoin: Sort-merge join for sorted inputs
Computes aggregates:
Aggregate(
group_by: [department],
aggregates: [Count(*), Avg(salary)]
)
Accumulates state and produces one tuple per group.
Orders tuples:
Sort(order_by: [created_at DESC])
Buffers all input tuples, sorts them, then produces sorted output.
Restricts output:
Limit(10)
Stops after producing N tuples.
Writes tuples to storage:
Insert(table: users, values: [...])
Modifies existing tuples:
Update(
table: users,
set: [(age, age + 1)],
filter: Some(active = true)
)
Removes tuples:
Delete(table: users, filter: Some(age < 18))
Expressions are evaluated recursively:
BinaryOp(
op: GreaterThan,
left: Column("age"),
right: Literal(18)
)
Evaluation:
- Evaluate left operand → retrieve "age" column value
- Evaluate right operand → constant 18
- Apply operator → compare values
Expressions are strongly typed. Type checking happens during semantic analysis:
age > 18 // Valid: integer > integer
age > "foo" // Invalid: integer > text
age + "5" // Invalid: integer + text
SQL-style three-valued logic:
null = null → null
null > 18 → null
null AND true → null
null OR true → true
Each query runs with tracked resources:
struct QueryLimits {
max_cpu_time: Duration,
max_memory: usize,
max_wall_time: Duration,
}- CPU time tracked via execution instrumentation
- Memory tracked via allocator hooks
- Wall time tracked via deadline checks
If limits are exceeded, execution is aborted:
Error: Query exceeded memory limit (100MB)
Queries are assigned priorities:
High: Interactive queriesNormal: Standard queriesLow: Background jobs
Lower priority queries yield CPU to higher priority queries.
When enabled:
- System clock access is forbidden
- Random number generation is seeded
- Operator scheduling is deterministic
- Memory allocation is deterministic
Time is tracked via Hybrid Logical Clock (HLC):
struct LogicalTime {
logical: u64,
physical: u64,
}In deterministic mode, physical is frozen and only logical advances.
- Reproducible debugging
- Consistent replication
- Predictable testing
- Audit compliance
Operators respect transaction snapshots:
struct Snapshot {
xid: TransactionId,
created_at: LogicalTime,
active_xids: Vec<TransactionId>,
}Tuples are visible if:
- Created by committed transaction
- Created before snapshot time
- Not created by active transaction
Operators can execute against historical snapshots:
retrieve users
where age > 18
at timestamp '2024-11-10 12:03:21'
The scan operator uses the specified snapshot instead of the current one.
SeqScan: O(n) where n is table sizeIndexScan: O(log n + k) where k is result sizeBitmapScan: O(m log n + k) where m is number of conditions
NestedLoopJoin: O(n × m)HashJoin: O(n + m) average, O(n × m) worst caseMergeJoin: O(n + m) for sorted inputs
Aggregate: O(n) with hash tableSortAggregate: O(n log n) with sorting
Sort: O(n log n)
ExecutionError: Runtime execution failureResourceExceeded: Query limit violatedDataCorruption: Storage integrity issueDeadlock: Transaction deadlock detected
On error:
- Abort execution
- Release resources
- Rollback transaction if active
- Return error to client
Each operator exposes metrics:
struct OperatorMetrics {
tuples_produced: u64,
cpu_time: Duration,
wall_time: Duration,
memory_used: usize,
}Query plans can be explained:
explain retrieve users where age > 18
Output:
Limit(10) [cost: 5.2, rows: 10]
Filter(age > 18) [cost: 104.5, rows: 500]
SeqScan(users) [cost: 100.0, rows: 1000]
Execution is traced for debugging:
[TRACE] Scan::open(table=users)
[TRACE] Filter::open()
[TRACE] Limit::open()
[TRACE] Scan::next() → Some(Tuple(id=1, name="Alice", age=30))
[TRACE] Filter::next() → Some(Tuple(id=1, name="Alice", age=30))
[TRACE] Limit::next() → Some(Tuple(id=1, name="Alice", age=30))
Large tables can be scanned in parallel:
ParallelSeqScan(workers: 4)
Each worker scans a partition of the table.
Aggregates can be computed in parallel:
FinalizeAggregate
ParallelAggregate(workers: 4)
Workers compute partial aggregates, then a final operator combines them.
Parallel operators coordinate via channels:
struct ParallelContext {
workers: Vec<Worker>,
coordinator: Coordinator,
}Batch tuple processing for better CPU utilization.
JIT-compile hot operators for reduced interpretation overhead.
Re-optimize plans based on runtime statistics.
Push filters closer to scans across operator boundaries.