tpchgen-rs

Note, this is a fork of https://github.com/clflushopt/tpchgen-rs. See scripts/README.md for a list of differences.

Blazing fast TPCH benchmark data generator, in pure Rust with zero dependencies.

Features

Blazing Speed 🚀
Obsessively Tested 📋
Fully parallel, streaming, constant memory usage 🧠

Try it now

The easiest way to use this software is via the tpchgen-cli tool.

Generating Data

1. Set up the environment

Install pixi, then run pixi install to set up Rust 1.89.0, Python, and pyarrow in one step:

curl -fsSL https://pixi.sh/install.sh | bash
pixi install
pixi shell

2. Build `tpchgen-cli`

RUSTFLAGS='-C target-cpu=native' cargo build --release -p tpchgen-cli

Add the binary to your PATH:

export PATH="$PWD/target/release:$PATH"

3. Generate data

Use scripts/generate_tpch.py to generate a full dataset with GPU-optimized partition sizes, encodings, and compression settings:

python scripts/generate_tpch.py -s <SCALE_FACTOR> -f parquet -j <PARALLELISM> -o <OUTPUT_DIR>

For example, to generate scale factor 100 using 16 parallel jobs:

python scripts/generate_tpch.py -s 100 -f parquet -j 16 -o tpch-sf100

Key options:

Option	Description	Default
`-s`	Scale factor (integer)	`1000`
`-f`	Output format: `parquet` or `tbl`	`parquet`
`-o`	Output directory	`tpch-data`
`-j`	Number of parallel jobs	number of CPU threads
`--parquet-row-group-bytes N`	Override row group size in bytes for all tables	per-table defaults
`--use-upstream-compression`	Compress all columns (default skips incompressible ones)	off
`--use-float-type`	Use `f64` for decimal columns instead of `decimal128`	off
`--use-timestamp-type`	Use `timestamp_ms` for date columns instead of `date32`	off
`--use-large-ids`	Use `i64` for `nationkey`/`regionkey` instead of `i32`	off

The script writes each table into its own subdirectory as zero-indexed partition files:

tpch-sf100/
├── customer/
│   └── part.0.parquet
├── lineitem/
│   ├── part.0.parquet
│   ├── part.1.parquet
│   └── part.2.parquet   (6 files at SF=100)
├── orders/
│   ├── part.0.parquet
│   └── part.1.parquet   (2 files at SF=100)
└── ...

A metadata.json file is written to the output directory summarising the schema, row counts, file sizes, and encodings of every generated table. See scripts/README.md for full details on the partition sizing formula and all available options.

Performance

tpchgen-cli is more than 10x faster than the next fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily generates data faster than can be written to SSD. See BENCHMARKS.md for more details on performance and benchmarking.

Times to create TPCH tables in Parquet format using tpchgen-cli and duckdb for various scale factors.

Scale Factor	`tpchgen-cli`	DuckDB	DuckDB (proprietary)
1	`0:02.24`	`0:12.29`	`0:10.68`
10	`0:09.97`	`1:46.80`	`1:41.14`
100	`1:14.22`	`17:48.27`	`16:40.88`
1000	`10:26.26`	N/A (OOM)	N/A (OOM)

DuckDB (proprietary) is the time required to create TPCH data using the proprietary DuckDB format
Creating Scale Factor 1000 using DuckDB required 647 GB of memory, which is why it is not included in the table above.

Answers

The core tpchgen crate provides answers for queries 1 to 22 and for a scale factor of 1. The answers exposed were derived from the TPC-H Tools official distribution.

Testing

This crate has extensive tests to ensure correctness and produces exactly the same, byte-for-byte output as the original dbgen implementation. We compare the output of this crate with dbgen as part of every checkin. See TESTING.md for more details on testing methodology

Crates

tpchgen: the core data generator logic for TPC-H. It has no dependencies and is easy to embed in other Rust project.
tpchgen-arrow generates TPC-H data in Apache Arrow format. It depends on the arrow-rs library
tpchgen-cli is a dbgen compatible CLI tool that generates benchmark dataset using multiple processes.

Contributing

Pull requests are welcome. For major changes, please open an issue first for discussion. See our contributors guide for more details.

Architecture

Please see architecture guide for details on how the code is structured.

License

The project is licensed under the APACHE 2.0 license.

References

The TPC-H Specification, see the specification page.
The Original dbgen Implementation you must submit an official request to access the software dbgen at their official website

Name		Name	Last commit message	Last commit date
Latest commit History 268 Commits
.github		.github
benchmarks		benchmarks
patches		patches
scripts		scripts
tests		tests
tpcdsgen		tpcdsgen
tpchgen-arrow		tpchgen-arrow
tpchgen-cli		tpchgen-cli
tpchgen		tpchgen
.dockerignore		.dockerignore
.gitignore		.gitignore
.typos.toml		.typos.toml
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TESTING.md		TESTING.md
parquet-performance.png		parquet-performance.png
pixi.toml		pixi.toml
rust-toolchain.toml		rust-toolchain.toml
tbl-performance.png		tbl-performance.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tpchgen-rs

Features

Try it now

Generating Data

1. Set up the environment

2. Build `tpchgen-cli`

3. Generate data

Performance

Answers

Testing

Crates

Contributing

Architecture

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tpchgen-rs

Features

Try it now

Generating Data

1. Set up the environment

2. Build tpchgen-cli

3. Generate data

Performance

Answers

Testing

Crates

Contributing

Architecture

License

References

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Build `tpchgen-cli`

Packages