A from-scratch SQL query engine and mini relational database written in
pure Python (standard library only). queryd lexes and parses real SQL,
builds a logical query plan, optimizes it, and executes it with a
Volcano-style iterator pipeline over CSV and in-memory tables. It ships a
CLI REPL and — most importantly — a differential test harness that grades
every query against sqlite3, the most battle-tested SQL engine in the
standard library.
No external dependencies. No pip install. Runs on any CPython 3.10+.
SQL text ─▶ Lexer ─▶ Parser ─▶ AST ─▶ Logical Plan ─▶ Optimizer ─▶ Volcano Executor ─▶ rows
│
└── predicate pushdown (rule-based)
- It actually runs SQL. Not a toy that handles one query shape — it parses and executes a real subset of SQL and returns correct results.
- Provable correctness.
tests/test_oracle.pyloads the same CSV data into both queryd andsqlite3, runs the same SQL through both, and asserts the result sets are identical — across hand-written showcase queries and a 200-case randomized fuzz battery. SQLite is the oracle. - Textbook architecture. Lexer → recursive-descent + Pratt parser → AST → logical plan → physical Volcano operators. The code reads like a teaching reference.
- Zero dependencies, zero install. The engine, demo, CLI, and the test suite all run on a clean Python with nothing pip-installed.
git clone https://github.com/Vansh4195/queryd
cd queryd
# 1) Run the demo (10 showcase queries over real CSV data)
python demo/demo.py
# 2) Run the test suite — including the sqlite3 oracle + fuzzer.
# Zero-dependency runner (no pytest needed):
python run_tests.py
# ...or, if you have pytest:
python -m pytest
# 3) Open the SQL REPL over the demo CSVs
python -m queryd demo/data/employees.csv demo/data/departments.csv demo/data/sales.csvfrom queryd import Database
db = Database()
db.register_csv("employees", "demo/data/employees.csv")
db.register_csv("departments", "demo/data/departments.csv")
result = db.sql("""
SELECT d.name AS dept, COUNT(*) AS headcount, AVG(e.salary) AS avg_salary
FROM employees e JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000
GROUP BY d.name
HAVING COUNT(*) >= 2
ORDER BY avg_salary DESC
LIMIT 5
""")
print(result.pretty())
for row in result.as_dicts():
print(row)Ad-hoc, with rows passed inline as dicts:
from queryd import run_sql
r = run_sql("SELECT a FROM t WHERE a > 1 ORDER BY a",
t=[{"a": 1}, {"a": 2}, {"a": 3}])
print(r.rows) # [(2,), (3,)]$ python -m queryd demo/data/employees.csv demo/data/departments.csv
queryd 0.1.0 — type .help for commands, .quit to exit
loaded tables: departments, employees
queryd> SELECT city, COUNT(*) AS n FROM employees GROUP BY city ORDER BY n DESC;
+----------+---+
| city | n |
+----------+---+
| Austin | 9 |
| New York | 7 |
| Chicago | 4 |
+----------+---+
(3 rows)
Meta-commands:
| command | description |
|---|---|
.tables |
list registered tables |
.schema <table> |
show a table's columns + inferred types |
.explain <sql> |
print the logical plan tree for a query |
.read <file.sql> |
run all statements in a .sql file |
.help |
show help |
.quit / .exit |
leave the REPL |
One-shot execution: python -m queryd demo/data/*.csv -c "SELECT ...".
select ::= "SELECT" proj_list "FROM" table_ref join* where?
group_by? having? order_by? limit?
proj_list ::= proj ("," proj)*
proj ::= "*" | ident "." "*" | expr ("AS"? ident)?
table_ref ::= ident ("AS"? ident)?
join ::= "INNER"? "JOIN" table_ref "ON" expr
where ::= "WHERE" expr
group_by ::= "GROUP" "BY" expr ("," expr)*
having ::= "HAVING" expr
order_by ::= "ORDER" "BY" order_key ("," order_key)*
order_key ::= expr ("ASC" | "DESC")?
limit ::= "LIMIT" integer
expr ::= or_expr (* Pratt / precedence climbing *)
or_expr ::= and_expr ("OR" and_expr)*
and_expr ::= not_expr ("AND" not_expr)*
not_expr ::= "NOT" not_expr | cmp_expr
cmp_expr ::= add_expr (("=" | "<>" | "<" | "<=" | ">" | ">=") add_expr)*
add_expr ::= mul_expr (("+" | "-") mul_expr)*
mul_expr ::= unary (("*" | "/") unary)*
unary ::= "-" unary | primary
primary ::= number | string | "NULL" | "TRUE" | "FALSE"
| column_ref | func_call | "(" expr ")"
column_ref ::= ident ("." ident)?
func_call ::= agg "(" (expr | "*") ")"
agg ::= "COUNT" | "SUM" | "AVG" | "MIN" | "MAX"
- Clauses:
SELECT,FROM,INNER JOIN ... ON,WHERE,GROUP BY,HAVING,ORDER BY(multi-key,ASC/DESC, by output alias, by an aggregate that isn't selected, or by 1-based output ordinal),LIMIT. - Aggregates:
COUNT(*),COUNT(x),SUM,AVG,MIN,MAX, with or withoutGROUP BY(no group key = a single global-aggregate row). - Expressions: arithmetic (
+ - * /), comparisons (= <> < <= > >=), booleanAND/OR/NOT, parentheses, column qualifiers (e.salary), literals (int/float/string/NULL/TRUE/FALSE), aliases (AS). - SQLite-compatible semantics where it matters for the oracle: integer ÷
integer truncates toward zero, division by zero is
NULL, comparisons withNULLyieldNULL, three-valued (Kleene) logic forAND/OR/NOT(NOT NULLisNULL;TRUE OR NULLisTRUE;FALSE AND NULLisFALSE),SUMof zero rows isNULL,AVGreturns a float,MIN/MAXorder numbers before text (and never crash on a mixed-type column), andORDER BYputsNULLs first (ASC) / last (DESC).
queryd is a read-only analytical engine. It does not implement: writes /
DDL / transactions, indexes or a cost-based optimizer (only rule-based predicate
pushdown), subqueries, window functions, UNION, DISTINCT, BETWEEN/IN/
LIKE, SQLite's text↔number affinity coercion (arithmetic on a text value
raises a clean ExecError rather than coercing), or GROUP BY ordinals.
Everything listed under "Supported SQL" is implemented and tested against
SQLite.
queryd/
├── tokens.py TokenType enum, Token dataclass, KEYWORDS
├── lexer.py source string -> [Token] (single O(n) scan)
├── ast.py immutable AST node dataclasses
├── parser.py [Token] -> Select AST (recursive descent + Pratt)
├── catalog.py Schema / ColumnInfo + CSV type inference + positional resolve
├── storage.py TableSource ABC; CSVTable (streams), MemoryTable
├── expr.py compile_expr(node, schema) -> closure(row) -> value
├── plan.py AST -> logical plan + predicate-pushdown optimizer + explain()
├── operators.py Volcano operators (Scan/Filter/Project/HashJoin/Aggregate/Sort/Limit)
├── executor.py logical plan -> physical operator tree; drives the root iterator
├── engine.py Database facade + Result (pretty/as_dicts) + run_sql()
└── cli.py REPL, meta-commands, boxed output, main()
Volcano executor. Every operator is a Python iterator that pulls rows
lazily from its child(ren). Lazy operators (Scan, Filter, Project,
Limit) stream; blocking operators (HashAggregate, Sort, and the build side
of HashJoin) materialize only what they must. LIMIT short-circuits upstream.
Hash join. Equi-joins (ON a = b) build a hash table on one side and probe
with the other in O(n+m); non-equi predicates fall back to a nested loop.
Expression compilation. Instead of walking the AST per row, compile_expr
returns a closure with column indices resolved once at plan time — the hot path
does no name lookups.
Rule-based optimizer. A WHERE conjunct that references only one base table
is pushed down to sit directly above that table's Scan, shrinking join inputs.
Run .explain to see it.
tests/test_oracle.py is the headline guarantee:
- Load the demo CSVs into an in-memory
sqlite3database with the same inferred column types, and into aqueryd.Database. - Run the identical SQL string through both engines.
- Normalize (round floats; map
True/Falseto1/0) and compare: for queries withoutORDER BY, as a multiset; for ordered queries, theORDER BYkey sequence must match exactly and the full result must match as a multiset (so unspecified tie order is tolerated, like SQL itself). - A fuzz generator composes 200 random-but-valid queries (random projections, predicates, group-bys, aggregates, ordering, limits) and runs the same differential check, so coverage isn't limited to hand-picked cases.
If queryd and SQLite ever disagree, the test fails with a diff dump.
queryd> .explain SELECT e.name, d.name FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 90000 AND d.location = 'Austin';
Project([name, name])
HashJoin(on e.dept_id = d.id)
Filter(e.salary > 90000)
Scan(employees AS e)
Filter(d.location = 'Austin')
Scan(departments AS d)
Each single-table WHERE conjunct sinks to a Filter directly above its
Scan, so the join sees fewer rows.
MIT © Vansh Singh