git clone https://github.com/HPMLL/SpInfer_EuroSys25.git
cd SpInfer_EuroSys25
git submodule update --init --recursive
source Init_SpInfer.sh
cd $SpInfer_HOME/third_party/FasterTransformer && git apply ../ft_spinfer.patch
cd $SpInfer_HOME/third_party/sputnik && git apply ../sputnik.patch- Requirements:
Ubuntu 16.04+gcc >= 7.3cmake >= 3.30.3CUDA >= 12.2andnvcc >= 12.0- NVIDIA GPU with
sm >= 80(i.e., Ampere-A6000 and Ada -RTX4090).
- 2.1 Install
condaon system Toturial. - 2.2 Create a
condaenvironment:
cd $SpInfer_HOME
conda env create -f spinfer.yml
conda activate spinfer
The libSpMM_API.so and SpMM_API.cuh will be available for easy integration after:
cd $SpInfer_HOME/build && make -j
- Build Sputnik.
cd $SpInfer_HOME/third_party/
source build_sputnik.sh- Build SparTA.
cd $SpInfer_HOME/third_party/
source preparse_cusparselt.sh- Reproduce Figure 10.
cd $SpInfer_HOME/kernel_benchmark
source test_env
make -j
source benchmark.shCheck the results in raw csv files and the reproduced Figure10.png (Fig. 10).
Follow the steps in SpInfer/docs/LLMInferenceExample
- Building Faster-Transformer with (SpInfer, Flash-llm or Standard) integration
- Downloading & Converting OPT models
- Configuration Note: Model_dir is different for SpInfer, Flash-llm and Faster-Transformer.
cd $SpInfer_HOME/third_party/bash run_1gpu_loop.sh- Check the results (Fig.13/14) in
$SpInfer_HOME/third_party/FasterTransformer/OutputFile_1gpu_our_60_inlen64/- Test tensor_para_size=2 using
bash run_2gpu_loop.sh- Test tensor_para_size=4 using
bash run_4gpu_loop.sh
cd $FlashLLM_HOME/third_party/bash run_1gpu_loop.sh- Check the results in
$FlashLLM_HOME/third_party/FasterTransformer/OutputFile_1gpu_our_60_inlen64/- Test tensor_para_size=1 using
bash run_1gpu_loop.sh
cd $FT_HOME/third_party/bash run_2gpu_loop.sh- Check the results in
$FT_HOME/FasterTransformer/OutputFile_2gpu_our_60_inlen64/
cd $SpInfer_HOME/end2end_inference/ds_scriptspip install -r requirements.txtbash run_ds_loop.sh- Check the results in
$SpInfer_HOME/end2end_inference/ds_scripts/ds_result/
If you find this work useful, please cite this project and our paper.
@inproceedings{fan2025spinfer,
title={SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs},
author={Fan, Ruibo and Yu, Xiangrui and Dong, Peijie and Li, Zeyu and Gong, Gu and Wang, Qiang and Wang, Wei and Chu, Xiaowen},
booktitle={Proceedings of the Twentieth European Conference on Computer Systems},
pages={243--260},
year={2025}
}