-
Notifications
You must be signed in to change notification settings - Fork 21
docs: update README #1286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: update README #1286
Changes from 6 commits
4138854
2a98fec
2bf16d6
271a358
c6218a0
8ca8f79
979be2c
a990c6c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,30 +11,38 @@ | |
|
|
||
| > **Xorq is a multi‑engine batch transformation framework built on Ibis, | ||
| > DataFusion and Arrow.** | ||
| > It ships a compute catalog and a multi-engine manifest you can run | ||
| > across DuckDB, Snowflake, DataFusion, and more. | ||
| > It ships a multi-engine manifest that you can run in SQL across DuckDB, | ||
| > Snowflake, DataFusion, and more. | ||
|
|
||
| --- | ||
|
|
||
| ## What Xorq gives you | ||
|
|
||
| - **Multi-engine manifest:** A single, typed plan captured as a YAML artifact | ||
| that can execute in DuckDB, Snowflake, DataFusion, etc. | ||
| - **Deterministic builds & caching:** Content hashes of the plan power | ||
| reproducible runs and cheap replays. | ||
| - **Lineage & Schemas:** Compile-time schema checks and end-to-end to end | ||
| column-level lineage. | ||
| - **Compute catalog:** Versioned registry that stores and operates on manifests | ||
| (run, cache, diff, serve-unbound). | ||
| - **Portable UDxFs:** Arbitrary python logic with schema-in/out contracts | ||
| portable via Arrow Flight. | ||
| - **Scikit-learn integration:** Model fitting pipeline captured in the predict | ||
| pipeline manifest for portable batch scoring and model training lineage | ||
|
|
||
| | Feature | Description | | ||
| |---|---| | ||
| |**Multi-engine manifest** | A single, typed plan (YAML manifest) that executes as SQL on DuckDB, Snowflake, and embedded DataFusion. | | ||
| |**Deterministic builds & caching** | One hash for everything—computed from **expression inputs**; for YAML-only builds, we hash the **expression**. The hash names `builds/<hash>/` and keys the cache. | | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| |**Lineage & schemas** | Compile-time schema checks with end-to-end, column-level lineage. | | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| |**Compute catalog** | Versioned registry to run, cache, diff, and serve-unbound manifests. | | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| |**Portable UDxFs** | Arbitrary Python logic with schema-in/out contracts, portable via Arrow Flight. | | ||
| |**`scikit-learn` integration** | Fit/predict pipelines serialize to a manifest for portable batch scoring with training lineage. | | ||
| |**Templates with `uv`** | `xorq init` ships a templates in **replicaple environments**—no “works on my machine.” | | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| > [!NOTE] | ||
| > **Not an orchestrator.** Use Xorq from Airflow, Dagster, GitHub Actions, etc. | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| > **Not streaming/online.** Xorq focuses on **batch**,**out-of-core** | ||
| > transformations. | ||
| > **Batch focus.** Not streaming/online—**batch**, **out-of-core** transformations. | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Supported backends | ||
|
|
||
| - DuckDB | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the logic for the order of this list? If there is not logic, then I suggest alphabetizing it. |
||
| - Snowflake | ||
| - BigQuery | ||
| - Postgres | ||
| - SQLite | ||
| - DataFusion (vanilla) | ||
| - Xorq-DataFusion (embedded) | ||
|
|
||
|
|
||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ## Quickstart | ||
|
|
@@ -48,7 +56,29 @@ Then follow the [Quickstart | |
| Tutorial](https://docs.xorq.dev/tutorials/getting_started/quickstart) for a | ||
| full walk-through using the Penguins dataset. | ||
|
|
||
| ## From `scikit-learn` to multi-engine manifest | ||
| ### Project Templates | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| We ship minimal, opinionated starter templates so you can go from | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Who is "we"? |
||
| zero-to-manifest fast. | ||
|
|
||
| - **Penguins:** Feature engineering + fit/predict LogisticRegression on the | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Penguins dataset. | ||
| - **Digits:** Fit/predict on the Digits dataset with a full pipeline (PCA + | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| classifier). | ||
|
|
||
| Each template includes: | ||
|
|
||
| ```bash | ||
| uv.lock — pinned dependencies for replicable envs | ||
| requirements.txt — bare minimal requirement | ||
| pyproject.toml — project metadata | ||
| expr.py — the expr entrypoint | ||
| ``` | ||
|
|
||
| #### Requirements for environment replicability for a Project: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is empty, why is the section here? Put it in a GitHub issue to add later, remove this. |
||
| - TBD | ||
|
|
||
| ## Multi-engine manifest for Machine Learning pipelines | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The manifest is a collection of YAML files that captures the expression graph | ||
| and supporting files like memtables serialized to disk. | ||
|
|
@@ -62,6 +92,9 @@ Once you xorq build your pipeline, you get: | |
| Xorq makes it easy to bring your scikit-learn Pipeline and automatically | ||
| converts it into a deferred Xorq expression. | ||
|
|
||
| **Engines used**: `duckdb` to read parquet, `datafusion` for running UDFs. | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
hussainsultan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ```python | ||
| import xorq.api as xo | ||
| from xorq.expr.ml.pipeline_lib import Pipeline | ||
|
|
@@ -96,8 +129,12 @@ predicted: | |
| body_mass_g: ... | ||
| species: ... # target | ||
| ``` | ||
| The YAML format serializes the Expression graph and all its nodes, including | ||
| UDFs as pickled entries. | ||
|
|
||
| We serialize the expression as a YAML manifest that captures the graph and all | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Who is "we"? |
||
| nodes (including UDFs as pickled entries); builds are content-addressed by the | ||
|
||
| expression hash. | ||
|
|
||
| This ensures expression-level replicability and round-trippability to Python. | ||
|
|
||
| ## From manifest to catalog | ||
|
|
||
|
|
@@ -135,7 +172,7 @@ xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir | |
| - `--cache-dir penguins_example`: Directory for caching results | ||
| - `b2370a29c19df8e1e639c63252dacd0e`: The node-hash that represents the expression input to replace | ||
|
|
||
| To learn more on how to find the node hash, check out the [Serve Unbound](https://docs.xorq.dev/tutorials/getting_started/quickstart#finding-the-node-hash). | ||
| To learn more on how to find the node hash, check out the [`serve-unbound`](https://docs.xorq.dev/tutorials/getting_started/quickstart#finding-the-node-hash) documentation. | ||
|
|
||
| ### Compose with the served expression: | ||
|
|
||
|
|
@@ -150,6 +187,20 @@ new_expr = expr.pipe(f) | |
| new_expr.execute() | ||
| ``` | ||
|
|
||
| ### Replicable environments with uv | ||
|
|
||
| Using the lock with Xorq | ||
|
|
||
| If a uv.lock is present, Xorq can use it directly: | ||
|
|
||
| ```bash | ||
| # Build using a locked env (hydrates if needed) | ||
| xorq uv-build | ||
|
|
||
| # Run a build with the locked env | ||
| xorq uv-run builds/<hash> | ||
| ``` | ||
|
|
||
|
Comment on lines
185
to
198
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we currently using |
||
| ## How Xorq works | ||
|
|
||
| Xorq uses Apache Arrow Flight RPC for zero-copy data transfer and leverages Ibis and | ||
|
|
@@ -161,18 +212,18 @@ DataFusion under the hood for efficient computation. | |
|
|
||
| A generic catalog that can be used to build new workloads: | ||
|
|
||
| - ML/data pipeline development (deterministic builds, caching, replicable envs) | ||
| - Lineage‑preserving, multi-engine feature stores (offline, reproducible) | ||
| - Composable data products (ship datasets as compute artifacts) | ||
| - Governed sharing of compute (catalog entries as the contract between teams) | ||
| - ML/data pipeline development (deterministic builds) | ||
|
|
||
|
|
||
| Also great for: | ||
|
|
||
| - Generating SQL from high-level DSLs (e.g. Semantic Layers) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Avoid using e.g. and i.e. |
||
| - Batch model scoring across engines (same expr, different backends) | ||
| - Cross‑warehouse migrations (portability via Ibis + UDxFs) | ||
hussainsultan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - Data CI (compile‑time schema/lineage checks in PRs) | ||
| - ML Experiment Tracking (versioned manifests with cached results) | ||
|
|
||
|
|
||
| ## Learn More | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.