Skip to content

sirius-db/tpchgen-rs

 
 

Repository files navigation

tpchgen-rs


Note, this is a fork of https://github.com/clflushopt/tpchgen-rs. See scripts/README.md for a list of differences.


Apache licensed Build Status

Blazing fast TPCH benchmark data generator, in pure Rust with zero dependencies.

Features

  1. Blazing Speed 🚀
  2. Obsessively Tested 📋
  3. Fully parallel, streaming, constant memory usage 🧠

Try it now

The easiest way to use this software is via the tpchgen-cli tool.

Generating Data

1. Set up the environment

Install pixi, then run pixi install to set up Rust 1.89.0, Python, and pyarrow in one step:

curl -fsSL https://pixi.sh/install.sh | bash
pixi install
pixi shell

2. Build tpchgen-cli

RUSTFLAGS='-C target-cpu=native' cargo build --release -p tpchgen-cli

Add the binary to your PATH:

export PATH="$PWD/target/release:$PATH"

3. Generate data

Use scripts/generate_tpch.py to generate a full dataset with GPU-optimized partition sizes, encodings, and compression settings:

python scripts/generate_tpch.py -s <SCALE_FACTOR> -f parquet -j <PARALLELISM> -o <OUTPUT_DIR>

For example, to generate scale factor 100 using 16 parallel jobs:

python scripts/generate_tpch.py -s 100 -f parquet -j 16 -o tpch-sf100

Key options:

Option Description Default
-s Scale factor (integer) 1000
-f Output format: parquet or tbl parquet
-o Output directory tpch-data
-j Number of parallel jobs number of CPU threads
--parquet-row-group-bytes N Override row group size in bytes for all tables per-table defaults
--use-upstream-compression Compress all columns (default skips incompressible ones) off
--use-float-type Use f64 for decimal columns instead of decimal128 off
--use-timestamp-type Use timestamp_ms for date columns instead of date32 off
--use-large-ids Use i64 for nationkey/regionkey instead of i32 off

The script writes each table into its own subdirectory as zero-indexed partition files:

tpch-sf100/
├── customer/
│   └── part.0.parquet
├── lineitem/
│   ├── part.0.parquet
│   ├── part.1.parquet
│   └── part.2.parquet   (6 files at SF=100)
├── orders/
│   ├── part.0.parquet
│   └── part.1.parquet   (2 files at SF=100)
└── ...

A metadata.json file is written to the output directory summarising the schema, row counts, file sizes, and encodings of every generated table. See scripts/README.md for full details on the partition sizing formula and all available options.

Performance

tpchgen-cli is more than 10x faster than the next fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily generates data faster than can be written to SSD. See BENCHMARKS.md for more details on performance and benchmarking.

Times to create TPCH tables in Parquet format using tpchgen-cli and duckdb for various scale factors.

Scale Factor tpchgen-cli DuckDB DuckDB (proprietary)
1 0:02.24 0:12.29 0:10.68
10 0:09.97 1:46.80 1:41.14
100 1:14.22 17:48.27 16:40.88
1000 10:26.26 N/A (OOM) N/A (OOM)
  • DuckDB (proprietary) is the time required to create TPCH data using the proprietary DuckDB format
  • Creating Scale Factor 1000 using DuckDB required 647 GB of memory, which is why it is not included in the table above.

Parquet Generation Performance

Answers

The core tpchgen crate provides answers for queries 1 to 22 and for a scale factor of 1. The answers exposed were derived from the TPC-H Tools official distribution.

Testing

This crate has extensive tests to ensure correctness and produces exactly the same, byte-for-byte output as the original dbgen implementation. We compare the output of this crate with dbgen as part of every checkin. See TESTING.md for more details on testing methodology

Crates

  • tpchgen: the core data generator logic for TPC-H. It has no dependencies and is easy to embed in other Rust project.

  • tpchgen-arrow generates TPC-H data in Apache Arrow format. It depends on the arrow-rs library

  • tpchgen-cli is a dbgen compatible CLI tool that generates benchmark dataset using multiple processes.

Contributing

Pull requests are welcome. For major changes, please open an issue first for discussion. See our contributors guide for more details.

Architecture

Please see architecture guide for details on how the code is structured.

License

The project is licensed under the APACHE 2.0 license.

References

  • The TPC-H Specification, see the specification page.
  • The Original dbgen Implementation you must submit an official request to access the software dbgen at their official website

About

TPC-H benchmark data generation in pure Rust

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 98.6%
  • Shell 1.4%