Skip to content

Vansh4195/queryd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

queryd

A from-scratch SQL query engine and mini relational database written in pure Python (standard library only). queryd lexes and parses real SQL, builds a logical query plan, optimizes it, and executes it with a Volcano-style iterator pipeline over CSV and in-memory tables. It ships a CLI REPL and — most importantly — a differential test harness that grades every query against sqlite3, the most battle-tested SQL engine in the standard library.

No external dependencies. No pip install. Runs on any CPython 3.10+.

SQL text ─▶ Lexer ─▶ Parser ─▶ AST ─▶ Logical Plan ─▶ Optimizer ─▶ Volcano Executor ─▶ rows
                                            │
                                            └── predicate pushdown (rule-based)

Why it's interesting

  • It actually runs SQL. Not a toy that handles one query shape — it parses and executes a real subset of SQL and returns correct results.
  • Provable correctness. tests/test_oracle.py loads the same CSV data into both queryd and sqlite3, runs the same SQL through both, and asserts the result sets are identical — across hand-written showcase queries and a 200-case randomized fuzz battery. SQLite is the oracle.
  • Textbook architecture. Lexer → recursive-descent + Pratt parser → AST → logical plan → physical Volcano operators. The code reads like a teaching reference.
  • Zero dependencies, zero install. The engine, demo, CLI, and the test suite all run on a clean Python with nothing pip-installed.

Quick start

git clone https://github.com/Vansh4195/queryd
cd queryd

# 1) Run the demo (10 showcase queries over real CSV data)
python demo/demo.py

# 2) Run the test suite — including the sqlite3 oracle + fuzzer.
#    Zero-dependency runner (no pytest needed):
python run_tests.py
#    ...or, if you have pytest:
python -m pytest

# 3) Open the SQL REPL over the demo CSVs
python -m queryd demo/data/employees.csv demo/data/departments.csv demo/data/sales.csv

Embed it in Python

from queryd import Database

db = Database()
db.register_csv("employees", "demo/data/employees.csv")
db.register_csv("departments", "demo/data/departments.csv")

result = db.sql("""
    SELECT d.name AS dept, COUNT(*) AS headcount, AVG(e.salary) AS avg_salary
    FROM employees e JOIN departments d ON e.dept_id = d.id
    WHERE e.salary > 50000
    GROUP BY d.name
    HAVING COUNT(*) >= 2
    ORDER BY avg_salary DESC
    LIMIT 5
""")

print(result.pretty())
for row in result.as_dicts():
    print(row)

Ad-hoc, with rows passed inline as dicts:

from queryd import run_sql
r = run_sql("SELECT a FROM t WHERE a > 1 ORDER BY a",
            t=[{"a": 1}, {"a": 2}, {"a": 3}])
print(r.rows)   # [(2,), (3,)]

CLI

$ python -m queryd demo/data/employees.csv demo/data/departments.csv
queryd 0.1.0 — type .help for commands, .quit to exit
loaded tables: departments, employees

queryd> SELECT city, COUNT(*) AS n FROM employees GROUP BY city ORDER BY n DESC;
+----------+---+
| city     | n |
+----------+---+
| Austin   | 9 |
| New York | 7 |
| Chicago  | 4 |
+----------+---+
(3 rows)

Meta-commands:

command description
.tables list registered tables
.schema <table> show a table's columns + inferred types
.explain <sql> print the logical plan tree for a query
.read <file.sql> run all statements in a .sql file
.help show help
.quit / .exit leave the REPL

One-shot execution: python -m queryd demo/data/*.csv -c "SELECT ...".


Supported SQL

select      ::= "SELECT" proj_list "FROM" table_ref join* where?
                group_by? having? order_by? limit?
proj_list   ::= proj ("," proj)*
proj        ::= "*" | ident "." "*" | expr ("AS"? ident)?
table_ref   ::= ident ("AS"? ident)?
join        ::= "INNER"? "JOIN" table_ref "ON" expr
where       ::= "WHERE" expr
group_by    ::= "GROUP" "BY" expr ("," expr)*
having      ::= "HAVING" expr
order_by    ::= "ORDER" "BY" order_key ("," order_key)*
order_key   ::= expr ("ASC" | "DESC")?
limit       ::= "LIMIT" integer

expr        ::= or_expr            (* Pratt / precedence climbing *)
or_expr     ::= and_expr ("OR" and_expr)*
and_expr    ::= not_expr ("AND" not_expr)*
not_expr    ::= "NOT" not_expr | cmp_expr
cmp_expr    ::= add_expr (("=" | "<>" | "<" | "<=" | ">" | ">=") add_expr)*
add_expr    ::= mul_expr (("+" | "-") mul_expr)*
mul_expr    ::= unary (("*" | "/") unary)*
unary       ::= "-" unary | primary
primary     ::= number | string | "NULL" | "TRUE" | "FALSE"
              | column_ref | func_call | "(" expr ")"
column_ref  ::= ident ("." ident)?
func_call   ::= agg "(" (expr | "*") ")"
agg         ::= "COUNT" | "SUM" | "AVG" | "MIN" | "MAX"
  • Clauses: SELECT, FROM, INNER JOIN ... ON, WHERE, GROUP BY, HAVING, ORDER BY (multi-key, ASC/DESC, by output alias, by an aggregate that isn't selected, or by 1-based output ordinal), LIMIT.
  • Aggregates: COUNT(*), COUNT(x), SUM, AVG, MIN, MAX, with or without GROUP BY (no group key = a single global-aggregate row).
  • Expressions: arithmetic (+ - * /), comparisons (= <> < <= > >=), boolean AND / OR / NOT, parentheses, column qualifiers (e.salary), literals (int/float/string/NULL/TRUE/FALSE), aliases (AS).
  • SQLite-compatible semantics where it matters for the oracle: integer ÷ integer truncates toward zero, division by zero is NULL, comparisons with NULL yield NULL, three-valued (Kleene) logic for AND/OR/NOT (NOT NULL is NULL; TRUE OR NULL is TRUE; FALSE AND NULL is FALSE), SUM of zero rows is NULL, AVG returns a float, MIN/MAX order numbers before text (and never crash on a mixed-type column), and ORDER BY puts NULLs first (ASC) / last (DESC).

Non-goals (honest boundaries)

queryd is a read-only analytical engine. It does not implement: writes / DDL / transactions, indexes or a cost-based optimizer (only rule-based predicate pushdown), subqueries, window functions, UNION, DISTINCT, BETWEEN/IN/ LIKE, SQLite's text↔number affinity coercion (arithmetic on a text value raises a clean ExecError rather than coercing), or GROUP BY ordinals. Everything listed under "Supported SQL" is implemented and tested against SQLite.


Architecture

queryd/
├── tokens.py      TokenType enum, Token dataclass, KEYWORDS
├── lexer.py       source string  -> [Token]            (single O(n) scan)
├── ast.py         immutable AST node dataclasses
├── parser.py      [Token] -> Select AST                (recursive descent + Pratt)
├── catalog.py     Schema / ColumnInfo + CSV type inference + positional resolve
├── storage.py     TableSource ABC; CSVTable (streams), MemoryTable
├── expr.py        compile_expr(node, schema) -> closure(row) -> value
├── plan.py        AST -> logical plan + predicate-pushdown optimizer + explain()
├── operators.py   Volcano operators (Scan/Filter/Project/HashJoin/Aggregate/Sort/Limit)
├── executor.py    logical plan -> physical operator tree; drives the root iterator
├── engine.py      Database facade + Result (pretty/as_dicts) + run_sql()
└── cli.py         REPL, meta-commands, boxed output, main()

Volcano executor. Every operator is a Python iterator that pulls rows lazily from its child(ren). Lazy operators (Scan, Filter, Project, Limit) stream; blocking operators (HashAggregate, Sort, and the build side of HashJoin) materialize only what they must. LIMIT short-circuits upstream.

Hash join. Equi-joins (ON a = b) build a hash table on one side and probe with the other in O(n+m); non-equi predicates fall back to a nested loop.

Expression compilation. Instead of walking the AST per row, compile_expr returns a closure with column indices resolved once at plan time — the hot path does no name lookups.

Rule-based optimizer. A WHERE conjunct that references only one base table is pushed down to sit directly above that table's Scan, shrinking join inputs. Run .explain to see it.


The oracle (proof of correctness)

tests/test_oracle.py is the headline guarantee:

  1. Load the demo CSVs into an in-memory sqlite3 database with the same inferred column types, and into a queryd.Database.
  2. Run the identical SQL string through both engines.
  3. Normalize (round floats; map True/False to 1/0) and compare: for queries without ORDER BY, as a multiset; for ordered queries, the ORDER BY key sequence must match exactly and the full result must match as a multiset (so unspecified tie order is tolerated, like SQL itself).
  4. A fuzz generator composes 200 random-but-valid queries (random projections, predicates, group-bys, aggregates, ordering, limits) and runs the same differential check, so coverage isn't limited to hand-picked cases.

If queryd and SQLite ever disagree, the test fails with a diff dump.


Example: predicate pushdown via EXPLAIN

queryd> .explain SELECT e.name, d.name FROM employees e
        JOIN departments d ON e.dept_id = d.id
        WHERE e.salary > 90000 AND d.location = 'Austin';

Project([name, name])
  HashJoin(on e.dept_id = d.id)
    Filter(e.salary > 90000)
      Scan(employees AS e)
    Filter(d.location = 'Austin')
      Scan(departments AS d)

Each single-table WHERE conjunct sinks to a Filter directly above its Scan, so the join sees fewer rows.


License

MIT © Vansh Singh

About

A from-scratch SQL query engine and mini relational database in pure Python (stdlib only): lexer, recursive-descent + Pratt parser, logical plan with predicate pushdown, and a Volcano executor — graded for correctness against sqlite3.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages