Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,16 @@
[![Docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://deepseek-ai.github.io/smallpond/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A lightweight data processing framework built on [DuckDB] and [3FS].
A lightweight **distributed** data processing framework built on [DuckDB] and [Ray], with [3FS] integration for high-performance storage.

## Features

- 🚀 High-performance data processing powered by DuckDB
- 🌍 Scalable to handle PB-scale datasets
- 🔄 Distributed execution through [Ray] for parallel processing of TBs data
- 🛠️ Easy operations with no long-running services
- 💾 Storage support for local filesystem and [3FS]
- 📈 Flexible partitioning strategies (hash, even, random)

## Installation

Expand All @@ -31,13 +34,13 @@ wget https://duckdb.org/data/prices.parquet
```python
import smallpond

# Initialize session
# Initialize session (automatically starts a local Ray cluster)
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
# Process data with partitioning for distributed execution
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

Expand All @@ -50,8 +53,9 @@ print(df.to_pandas())
## Documentation

For detailed guides and API reference:
- [Getting Started](docs/source/getstarted.rst)
- [API Reference](docs/source/api.rst)
- [Getting Started](https://deepseek-ai.github.io/smallpond/getstarted.html)
- [API Reference](https://deepseek-ai.github.io/smallpond/api.html)
- [Architecture](https://deepseek-ai.github.io/smallpond/architecture.html)

## Performance

Expand All @@ -60,6 +64,7 @@ We evaluated smallpond using the [GraySort benchmark] ([script]) on a cluster co
Details can be found in [3FS - Gray Sort].

[DuckDB]: https://duckdb.org/
[Ray]: https://ray.io/
[3FS]: https://github.com/deepseek-ai/3FS
[GraySort benchmark]: https://sortbenchmark.org/
[script]: benchmarks/gray_sort_benchmark.py
Expand Down
93 changes: 93 additions & 0 deletions docs/source/architecture.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
Architecture
============

smallpond uses a DAG-based execution model with lazy evaluation:

1. Operations build a logical plan as a directed acyclic graph (DAG)
2. Execution is triggered only when an action is called (write, compute, etc.)
3. Ray distributes tasks across workers, with each worker running its own DuckDB instance
4. Backend storage supported is 3FS, while local filesystem can also be used for development and testing

.. mermaid::
:align: center
:caption: smallpond Architecture

flowchart TD
%% Main Flow subgraph at the top
subgraph MF[Main Flow]
direction TB
User([User]):::userFlow
Code[User Code]:::userFlow
DAG[Logical Plan DAG]:::userFlow
Partitions[(Partitioned Data)]

User --> |"1 Creates DataFrame<br>operations"| Code
Code --> |"2 Builds"| DAG
DAG --> |"3 Optimizes & manual<br>partitions data"| Partitions
User --> |"6 Triggers execution<br>(write_parquet, compute)"| DAG
end

%% Distributed Execution subgraph
subgraph DE[Distributed Execution]
direction TB
RayCluster[Ray Cluster]:::execution

%% Workers level
Worker1[Worker 1]:::execution
Worker2[Worker 2]:::execution
Worker3[Worker 3]:::execution

%% DuckDB level
DuckDB1[DuckDB Instance]:::execution
DuckDB2[DuckDB Instance]:::execution
DuckDB3[DuckDB Instance]:::execution

%% Internal connections
RayCluster --> Worker1
RayCluster --> Worker2
RayCluster --> Worker3

Worker1 --> |"5a Processes<br>partition"| DuckDB1
Worker2 --> |"5b Processes<br>partition"| DuckDB2
Worker3 --> |"5c Processes<br>partition"| DuckDB3
end

%% Bottom row with Storage and Results side by side
subgraph Bottom[ ]
direction LR
subgraph SO[Storage Options]
direction LR
Storage[(Storage Layer)]:::storage
3FS[3FS]:::storage
LocalFS[Local FS]:::storage

Storage --> 3FS
Storage --> LocalFS
end

subgraph RC[Results Collection]
direction LR
Results[Results Collection]:::userFlow
end
end

%% Connect subgraphs
Partitions --> |"4 Distributes tasks<br>via Ray"| RayCluster
Partitions -.-> Storage

%% Storage connections with dotted lines
DuckDB1 -.-> |"Reads/Writes"| Storage
DuckDB2 -.-> |"Reads/Writes"| Storage
DuckDB3 -.-> |"Reads/Writes"| Storage

%% Results collection
DuckDB1 --> |"7a Produces"| Results
DuckDB2 --> |"7b Produces"| Results
DuckDB3 --> |"7c Produces"| Results
Results --> |"8 Returns"| User

%% Styling
classDef userFlow fill:#ff69b4,stroke:#333333,stroke-width:2px,color:#ffffff
classDef execution fill:#4CAF50,stroke:#333333,stroke-width:1px,color:#ffffff
classDef storage fill:#2196F3,stroke:#333333,stroke-width:1px,color:#ffffff
classDef default color:#ffffff
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinxcontrib.mermaid",
]

templates_path = ["_templates"]
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Why smallpond?
:maxdepth: 1

getstarted
architecture
internals

.. toctree::
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ dev = [
docs = [
"sphinx==7.1.2",
"pydata-sphinx-theme==0.14.4",
"sphinxcontrib-mermaid==1.0.0",
]
warc = [
"warcio >= 1.7.4",
Expand Down