This repository has the source code for the paper: TPCx-AI under the Microscope: A Benchmarking Debt Analysis, submitted at VLDB 2026.
In the folder notebooks/metadata_notebooks, you can find a set of notebooks to interactively analyze the TPCx-AI datasets and workloads at SF1.
For accessing the code for running an end-to-end evaluation on a remote server at SF1-30, please follow the instructions in the data_center folder.
To set up the environment for an interactive SF1 data exploration on TPCx-AI data, please execute the steps below.
git clone https://github.com/ilin-t/tpcx-ai-2-analysis
https://drive.google.com/file/d/1IZPBFwakTzEQwO9cWeD-HVcQAm1O6L73/view?usp=sharing
mkdir data && cd data
unzip raw_data.zip
Install all TPCx-AI dependencies:
cd setup
bash setup-python.sh
The setup-python.sh script installs two environments:
- python-venv for
Use Cases: 1, 4, 8, 10 - python-venv-ks for
Use Cases: 2, 3, 5, 6, 7, 9
cd ../notebooks/metadata_notebooks/
In this folder, there all 10 all use cases with a step-by-step breakdown and metadata generation.
The pipelines generate long metadata files and visualizations of their data distribution and/or decision boundaries. To skip the metadata generation and analyze the json files independently, skip to Metadata Analysis.
To compare some of the pipelines and execute larger SF analysis on ready data, head to notebooks/analysis folder and run the notebooks inside python-venv.
The raw metadata can be found notebooks/json_outputs directory.
The default pipelines of TPCx-AI can be found in the pipelines directory in .py format and in notebooks/default_notebooks directory in the .ipynb format.