Sparse Transformer Inference

This repo provides a pytorch extension that speedup transformer inference with fixed structured sparsity.

The end-to-end speedup & memory profiling can be obtained with end_to_end.py.

To profile the execution time of sparse transformer, launch python3 end_to_end.py --model sparse with nsight system.
To profile the execution time of dense transformer, launch python3 end_to_end.py --model dense with nsight system.
To profile the memory of sparse transformer, launch python3 end_to_end.py --model sparse --mem with nsight system.
To profile the memory of dense transformer, launch python3 end_to_end.py --model dense --mem with nsight system.

Dependencies

We generate the sparse mask with scipy.sparse. The pytorch version is 1.8.1+cu111. The memory profiling is based on pytorch_memlab, and we annotate our program with nvtx.

To build the custom kernels, please use the src/install.sh. As our kernels target on the V100 GPU's tensor core architecture, currently only sm70 is supported.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
verify		verify
README.md		README.md
atten_speedup.py		atten_speedup.py
attention.py		attention.py
cudaprofile.py		cudaprofile.py
end_to_end.py		end_to_end.py
sparse_encoder.py		sparse_encoder.py
spattention.py		spattention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Transformer Inference

Dependencies

About

Releases

Packages

Languages

apuaaChen/sparse_transformer_sc21

Folders and files

Latest commit

History

Repository files navigation

Sparse Transformer Inference

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages