Skip to content
Open
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 74 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,30 +11,38 @@

> **Xorq is a multi‑engine batch transformation framework built on Ibis,
> DataFusion and Arrow.**
> It ships a compute catalog and a multi-engine manifest you can run
> across DuckDB, Snowflake, DataFusion, and more.
> It ships a multi-engine manifest that you can run in SQL across DuckDB,
> Snowflake, DataFusion, and more.

---

## What Xorq gives you

- **Multi-engine manifest:** A single, typed plan captured as a YAML artifact
that can execute in DuckDB, Snowflake, DataFusion, etc.
- **Deterministic builds & caching:** Content hashes of the plan power
reproducible runs and cheap replays.
- **Lineage & Schemas:** Compile-time schema checks and end-to-end to end
column-level lineage.
- **Compute catalog:** Versioned registry that stores and operates on manifests
(run, cache, diff, serve-unbound).
- **Portable UDxFs:** Arbitrary python logic with schema-in/out contracts
portable via Arrow Flight.
- **Scikit-learn integration:** Model fitting pipeline captured in the predict
pipeline manifest for portable batch scoring and model training lineage

| Feature | Description |
|---|---|
|**Multi-engine manifest** | A single, typed plan (YAML manifest) that executes as SQL on DuckDB, Snowflake, and embedded DataFusion. |
|**Deterministic builds & caching** | One hash for everything—computed from **expression inputs**; for YAML-only builds, we hash the **expression**. The hash names `builds/<hash>/` and keys the cache. |
|**Lineage & schemas** | Compile-time schema checks with end-to-end, column-level lineage. |
|**Compute catalog** | Versioned registry to run, cache, diff, and serve-unbound manifests. |
|**Portable UDxFs** | Arbitrary Python logic with schema-in/out contracts, portable via Arrow Flight. |
|**`scikit-learn` integration** | Fit/predict pipelines serialize to a manifest for portable batch scoring with training lineage. |
|**Templates with `uv`** | `xorq init` ships a templates in **replicaple environments**—no “works on my machine.” |

> [!NOTE]
> **Not an orchestrator.** Use Xorq from Airflow, Dagster, GitHub Actions, etc.

> **Not streaming/online.** Xorq focuses on **batch**,**out-of-core**
> transformations.
> **Batch focus.** Not streaming/online—**batch**, **out-of-core** transformations.


### Supported backends

- DuckDB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logic for the order of this list? If there is not logic, then I suggest alphabetizing it.

- Snowflake
- BigQuery
- Postgres
- SQLite
- DataFusion (vanilla)
- Xorq-DataFusion (embedded)


## Quickstart
Expand All @@ -48,7 +56,29 @@ Then follow the [Quickstart
Tutorial](https://docs.xorq.dev/tutorials/getting_started/quickstart) for a
full walk-through using the Penguins dataset.

## From `scikit-learn` to multi-engine manifest
### Project Templates

We ship minimal, opinionated starter templates so you can go from

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is "we"?

zero-to-manifest fast.

- **Penguins:** Feature engineering + fit/predict LogisticRegression on the
Penguins dataset.
- **Digits:** Fit/predict on the Digits dataset with a full pipeline (PCA +
classifier).

Each template includes:

```bash
uv.lock — pinned dependencies for replicable envs
requirements.txt — bare minimal requirement
pyproject.toml — project metadata
expr.py — the expr entrypoint
```

#### Requirements for environment replicability for a Project:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is empty, why is the section here? Put it in a GitHub issue to add later, remove this.

- TBD

## Multi-engine manifest for Machine Learning pipelines

The manifest is a collection of YAML files that captures the expression graph
and supporting files like memtables serialized to disk.
Expand All @@ -62,6 +92,9 @@ Once you xorq build your pipeline, you get:
Xorq makes it easy to bring your scikit-learn Pipeline and automatically
converts it into a deferred Xorq expression.

**Engines used**: `duckdb` to read parquet, `datafusion` for running UDFs.


```python
import xorq.api as xo
from xorq.expr.ml.pipeline_lib import Pipeline
Expand Down Expand Up @@ -96,8 +129,12 @@ predicted:
body_mass_g: ...
species: ... # target
```
The YAML format serializes the Expression graph and all its nodes, including
UDFs as pickled entries.

We serialize the expression as a YAML manifest that captures the graph and all

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is "we"?

nodes (including UDFs as pickled entries); builds are content-addressed by the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are input-addressed, not content-addressed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are build hashes also input-addressed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

expression hash.

This ensures expression-level replicability and round-trippability to Python.

## From manifest to catalog

Expand Down Expand Up @@ -135,7 +172,7 @@ xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir
- `--cache-dir penguins_example`: Directory for caching results
- `b2370a29c19df8e1e639c63252dacd0e`: The node-hash that represents the expression input to replace

To learn more on how to find the node hash, check out the [Serve Unbound](https://docs.xorq.dev/tutorials/getting_started/quickstart#finding-the-node-hash).
To learn more on how to find the node hash, check out the [`serve-unbound`](https://docs.xorq.dev/tutorials/getting_started/quickstart#finding-the-node-hash) documentation.

### Compose with the served expression:

Expand All @@ -150,6 +187,20 @@ new_expr = expr.pipe(f)
new_expr.execute()
```

### Replicable environments with uv

Using the lock with Xorq

If a uv.lock is present, Xorq can use it directly:

```bash
# Build using a locked env (hydrates if needed)
xorq uv-build

# Run a build with the locked env
xorq uv-run builds/<hash>
```

Comment on lines 185 to 198
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we currently using requirements.txt to build the uv env, not uv.lock

## How Xorq works

Xorq uses Apache Arrow Flight RPC for zero-copy data transfer and leverages Ibis and
Expand All @@ -161,18 +212,18 @@ DataFusion under the hood for efficient computation.

A generic catalog that can be used to build new workloads:

- ML/data pipeline development (deterministic builds, caching, replicable envs)
- Lineage‑preserving, multi-engine feature stores (offline, reproducible)
- Composable data products (ship datasets as compute artifacts)
- Governed sharing of compute (catalog entries as the contract between teams)
- ML/data pipeline development (deterministic builds)


Also great for:

- Generating SQL from high-level DSLs (e.g. Semantic Layers)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using e.g. and i.e.

- Batch model scoring across engines (same expr, different backends)
- Cross‑warehouse migrations (portability via Ibis + UDxFs)
- Data CI (compile‑time schema/lineage checks in PRs)
- ML Experiment Tracking (versioned manifests with cached results)


## Learn More
Expand Down
Loading