deepseek-ai · mehd-io · Feb 28, 2025 · Feb 28, 2025 · Mar 6, 2025 · Mar 6, 2025
diff --git a/README.md b/README.md
@@ -5,13 +5,16 @@
 [![Docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://deepseek-ai.github.io/smallpond/)
 [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
 
-A lightweight data processing framework built on [DuckDB] and [3FS].
+A lightweight **distributed** data processing framework built on [DuckDB] and [Ray], with [3FS] integration for high-performance storage.
 
 ## Features
 
 - 🚀 High-performance data processing powered by DuckDB
 - 🌍 Scalable to handle PB-scale datasets
+- 🔄 Distributed execution through [Ray] for parallel processing of TBs data
 - 🛠️ Easy operations with no long-running services
+- 💾 Storage support for local filesystem and [3FS]
+- 📈 Flexible partitioning strategies (hash, even, random)
 
 ## Installation
 
@@ -31,13 +34,13 @@ wget https://duckdb.org/data/prices.parquet
 ```python
 import smallpond
 
-# Initialize session
+# Initialize session (automatically starts a local Ray cluster)
 sp = smallpond.init()
 
 # Load data
 df = sp.read_parquet("prices.parquet")
 
-# Process data
+# Process data with partitioning for distributed execution
 df = df.repartition(3, hash_by="ticker")
 df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)
 
@@ -50,8 +53,9 @@ print(df.to_pandas())
 ## Documentation
 
 For detailed guides and API reference:
-- [Getting Started](docs/source/getstarted.rst)
-- [API Reference](docs/source/api.rst)
+- [Getting Started](https://deepseek-ai.github.io/smallpond/getstarted.html)
+- [API Reference](https://deepseek-ai.github.io/smallpond/api.html)
+- [Architecture](https://deepseek-ai.github.io/smallpond/architecture.html)
 
 ## Performance
 
@@ -60,6 +64,7 @@ We evaluated smallpond using the [GraySort benchmark] ([script]) on a cluster co
 Details can be found in [3FS - Gray Sort].
 
 [DuckDB]: https://duckdb.org/
+[Ray]: https://ray.io/
 [3FS]: https://github.com/deepseek-ai/3FS
 [GraySort benchmark]: https://sortbenchmark.org/
 [script]: benchmarks/gray_sort_benchmark.py

diff --git a/docs/source/architecture.rst b/docs/source/architecture.rst
@@ -0,0 +1,93 @@
+Architecture
+============
+
+smallpond uses a DAG-based execution model with lazy evaluation:
+
+1. Operations build a logical plan as a directed acyclic graph (DAG)
+2. Execution is triggered only when an action is called (write, compute, etc.)
+3. Ray distributes tasks across workers, with each worker running its own DuckDB instance
+4. Backend storage supported is 3FS, while local filesystem can also be used for development and testing
+
+.. mermaid::
+   :align: center
+   :caption: smallpond Architecture
+
+   flowchart TD
+       %% Main Flow subgraph at the top
+       subgraph MF[Main Flow]
+           direction TB
+           User([User]):::userFlow
+           Code[User Code]:::userFlow
+           DAG[Logical Plan DAG]:::userFlow
+           Partitions[(Partitioned Data)]
+
+           User --> |"1 Creates DataFrame<br>operations"| Code
+           Code --> |"2 Builds"| DAG
+           DAG --> |"3 Optimizes & manual<br>partitions data"| Partitions
+           User --> |"6 Triggers execution<br>(write_parquet, compute)"| DAG
+       end
+
+       %% Distributed Execution subgraph
+       subgraph DE[Distributed Execution]
+           direction TB
+           RayCluster[Ray Cluster]:::execution
+
+           %% Workers level
+           Worker1[Worker 1]:::execution
+           Worker2[Worker 2]:::execution
+           Worker3[Worker 3]:::execution
+
+           %% DuckDB level
+           DuckDB1[DuckDB Instance]:::execution
+           DuckDB2[DuckDB Instance]:::execution
+           DuckDB3[DuckDB Instance]:::execution
+
+           %% Internal connections
+           RayCluster --> Worker1
+           RayCluster --> Worker2
+           RayCluster --> Worker3
+
+           Worker1 --> |"5a Processes<br>partition"| DuckDB1
+           Worker2 --> |"5b Processes<br>partition"| DuckDB2
+           Worker3 --> |"5c Processes<br>partition"| DuckDB3
+       end
+
+       %% Bottom row with Storage and Results side by side
+       subgraph Bottom[ ]
+           direction LR
+           subgraph SO[Storage Options]
+               direction LR
+               Storage[(Storage Layer)]:::storage
+               3FS[3FS]:::storage
+               LocalFS[Local FS]:::storage
+
+               Storage --> 3FS
+               Storage --> LocalFS
+           end
+
+           subgraph RC[Results Collection]
+               direction LR
+               Results[Results Collection]:::userFlow
+           end
+       end
+
+       %% Connect subgraphs
+       Partitions --> |"4 Distributes tasks<br>via Ray"| RayCluster
+       Partitions -.-> Storage
+
+       %% Storage connections with dotted lines
+       DuckDB1 -.-> |"Reads/Writes"| Storage
+       DuckDB2 -.-> |"Reads/Writes"| Storage
+       DuckDB3 -.-> |"Reads/Writes"| Storage
+
+       %% Results collection
+       DuckDB1 --> |"7a Produces"| Results
+       DuckDB2 --> |"7b Produces"| Results
+       DuckDB3 --> |"7c Produces"| Results
+       Results --> |"8 Returns"| User
+
+       %% Styling
+       classDef userFlow fill:#ff69b4,stroke:#333333,stroke-width:2px,color:#ffffff
+       classDef execution fill:#4CAF50,stroke:#333333,stroke-width:1px,color:#ffffff
+       classDef storage fill:#2196F3,stroke:#333333,stroke-width:1px,color:#ffffff
+       classDef default color:#ffffff
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -16,6 +16,7 @@
 extensions = [
     "sphinx.ext.autodoc",
     "sphinx.ext.autosummary",
+    "sphinxcontrib.mermaid",
 ]
 
 templates_path = ["_templates"]

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -19,6 +19,7 @@ Why smallpond?
    :maxdepth: 1
 
    getstarted
+   architecture
    internals
 
 .. toctree::

diff --git a/pyproject.toml b/pyproject.toml
@@ -59,6 +59,7 @@ dev = [
 docs = [
     "sphinx==7.1.2",
     "pydata-sphinx-theme==0.14.4",
+    "sphinxcontrib-mermaid==1.0.0",
 ]
 warc = [
     "warcio >= 1.7.4",