The Faster Swin-Transformer contains the Swin-Transformer model, a state-of-the-art vision transformer model which was presented in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. The abstract of the paper is the following:
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.
Our implementation is aligned with the official PyTorch implementation Swin-Transformer Github.
In this demo, you can run Faster Swin-Transformer as a C++ program.
- CMake >= 3.13 for PyTorch
- CUDA 11.0 or newer version
- NCCL 2.10 or newer version
- Python 3 is recommended because some features are not supported in python 2
- PyTorch: Verify on 1.10.0, >= 1.5.0 should work.
Recommand to use image nvcr.io/nvidia/pytorch:21.07-py3
.
docker run -ti --gpus all --rm nvcr.io/nvidia/pytorch:21.07-py3 bash
- Start the docker container, ensure mounting the project directory into it. For example:
docker run \ -it \ --rm \ --gpus=all \ -v {YOUR_FASTER_TRANSFORMER_PROJECT_DIR_ON_HOST}:/workspace/FasterTransformer \ --workdir /workspace/FasterTransformer \ nvcr.io/nvidia/pytorch:21.07-py3 bash export WORKSPACE = /workspace/FasterTransformer
Here, we use nvcr.io/nvidia/pytorch:21.07-py3
, you can also switch it to another CUDA-enabled PyTorch containers, but need to comply with the previous requirements.
- Build the FasterTransformer with C++:
cd $WORKSPACE git submodule update --init mkdir -p build cd build cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_TRT=ON .. make
Note: xx is the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).
Firstly we use ./bin/swin_gemm
as the tool to search the best GEMM configuration. And then run ./bin/swin_example
or ./bin/swin_int8_example
.
Data Type = 0 (FP32) or 1 (FP16) or 2 (BF16)
# is_fp16=0 indicates FP32, is_fp16=1 indicates FP16
# model_type={0,1,2,3,4,5}
# 0: swin-TINY with window size 7x7
# 1: swin-SMALL with window size 7x7
# 2: swin-BASE with window size 7x7
# 3: swin-BASE with window size 12x12
# 4: swin-LARGE with window size 7x7
# 5: swin-LARGE with window size 12x12
./bin/swin_gemm <batch_size> <image_width> <window_width> <head_number of the first block> <size_per_head> <data_type> <is_use_int8>
./bin/swin_example <is_fp16> <model_type[0-5]> <batch_size>
./bin/swin_int8_example <model_type[0-5]> <batch_size>
Take swin-TINY with batch=32 as an example:
# Run Swin-Transformer(TINY) under FP32 on C++:
./bin/swin_gemm 32 224 7 3 32 0 0
./bin/swin_example 0 0 32
# Run Swin-Transformer(TINY) under FP16 on C++
./bin/swin_gemm 32 224 7 3 32 1 0
./bin/swin_example 1 0 32
# Run Swin-Transformer(TINY) under INT8 on C++
./bin/swin_gemm 32 224 7 3 32 0 1
./bin/swin_int8_example 0 32
Download checkpoint
cd $WORKSPACE/examples/pytorch/swin/Swin-Transformer-Quantization
wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
Run FP16/FP32 pytorch op
cd $WORKSPACE/examples/pytorch/swin
pip install timm==0.4.12
pip install termcolor==1.1.0
bash -x run_test.sh <batch_size> ##profile of FP16/FP32 model
Run INT8 pytorch op
-
Get calibrated checkpoint
Refer to Guide of Swin-Transformer Quantization Toolkit when installing dependencies and setting up datasets
cd $WORKSPACE/examples/pytorch/swin/Swin-Transformer-Quantization
python -m torch.distributed.launch --nproc_per_node 1 \
--master_port 12345 main.py \
--calib \
--cfg SwinTransformer/configs/swin_tiny_patch4_window7_224.yaml \
--resume swin_tiny_patch4_window7_224.pth \
--data-path <imagenet-path> \
--num-calib-batch 10 \
--calib-batchsz 8\
--int8-mode 1\
--calib-output-path calib-checkpoint
NOTE: If you ONLY want to use PTQ instead of QAT: when calibrating TINY/SMALL/BASE model, --int8-mode 1
suffices. When calibrating LARGE model, we have to specify --int8-mode 2
instead of --int8-mode 1
. The reason is, Swin-L
is much harder to quantize, and we have to disable more quantization nodes in order to obtain satisfactory PTQ accuracy results.
(When int8_mode=1
, all GEMMs are INT8-in-INT8-out.
When int8_mode=2
, GEMM of all fc2
layers and patchMerge
are relaxed to INT8-in-INT32-out, while other GEMMs keep INT8-I/O. )
If you want to insists on using --int8-mode 1
for LARGE model (because speed of mode=1 is much faster), we recommend using QAT to finetune paramters of LARGE checkpoint.
- Run test
cd $WORKSPACE/examples/pytorch/swin
pip install timm==0.4.12
pip install termcolor==1.1.0
bash -x run_test_int8.sh batch_size ##profile of INT8 model
bash -x run_test_int8_accuracy.sh batch_size ##test accuracy of INT8 model
Note: When testing PTQ accuracy for INT8 Swin-LARGE, we have to specify --int8-mode 2
instead of --int8-mode 1
in run_test_int8.sh.
However, if you have finetuned a Swin-LARGE using QAT and --int8-mode 1
, then you can do inference with --int8-mode 1
too. In a word, consistency have to be ensured.
- Accuracy loss of INT8 Post-Training-Quantization(PTQ) compared with FP16
name | resolution | acc@1 | acc@5 | acc@1(I) | acc@5(I) |
---|---|---|---|---|---|
Swin-T | 224x224 | 81.182 | 95.522 | 80.748(-0.434) | 95.328(-0.194) |
Swin-S | 224x224 | 83.214 | 96.242 | 82.904(-0.288) | 96.196(-0.036) |
Swin-B | 224x224 | 83.424 | 96.446 | 83.116(-0.354) | 96.322(-0.144) |
Swin-B | 384x384 | 84.474 | 96.956 | 84.034(-0.440) | 96.782(-0.174) |
Swin-L | 224x224 | 86.246 | 97.876 | 85.892(-0.354) | 97.822(-0.054) |
Swin-L | 384x384 | 87.246 | 98.248 | 86.916(-0.330) | 98.174(-0.074) |
FP16/FP32 TensorRT plugin
cd $WORKSPACE/examples/tensorrt/swin
#FP16 engine build & infer
sh run_builder_fp16.sh
sh run_infer_fp16.sh batch_size
#FP32 engine build & infer
sh run_builder_fp32.sh
sh run_infer_fp32.sh batch_size
INT8 TensorRT plugin
cd $WORKSPACE/examples/tensorrt/swin
#INT8 engine build & infer
sh run_builder_int8.sh
sh run_infer_int8.sh batch_size
Hardware settings:
- T4 (with mclk 5000MHz, pclk 1590MHz) with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
- A100 (with mclk 1215, pclk 1410MHz) with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
Software settings:
- CUDA 11.4
We here compared the performance between Swin-Transformer and FT Swin-Transformer on T4 & A100. Here we used Swin-TINY as an example, and the hyper-parameters of the model are:
- head_num = {3,6,12,24}
- size_per_head = 32
- num_of_blocks = {2,2,6,2}
Here, torch.jit.trace
means using tracing to convert Torch model to TorchScript model and then profile its performace.
Batch_size | torch.jit.trace | cpp | speedup | trt plugin | speedup | torch op | speedup |
---|---|---|---|---|---|---|---|
1 | 12.10 | 4.60 | 2.63 | 4.86 | 2.49 | 4.66 | 2.60 |
8 | 36.94 | 28.70 | 1.29 | 30.42 | 1.21 | 29.45 | 1.25 |
16 | 72.17 | 55.90 | 1.29 | 60.35 | 1.20 | 57.72 | 1.25 |
32 | 142.30 | 108.60 | 1.31 | 118.00 | 1.21 | 112.57 | 1.26 |
Batch_size | torch.jit.trace | cpp | speedup | trt plugin | speedup | torch op | speedup |
---|---|---|---|---|---|---|---|
1 | 10.81 | 1.40 | 7.72 | 1.48 | 7.30 | 2.15 | 5.03 |
8 | 18.77 | 6.80 | 2.76 | 7.32 | 2.56 | 7.00 | 2.68 |
16 | 37.63 | 13.30 | 2.83 | 14.52 | 2.59 | 13.87 | 2.71 |
32 | 75.36 | 26.10 | 2.89 | 28.98 | 2.60 | 27.70 | 2.72 |
Batch_size | torch.jit.trace | cpp | speedup(vs FP16) | trt plugin | speedup(vs FP16) | torch op | speedup(vs FP16) |
---|---|---|---|---|---|---|---|
1 | 1.24 | 1.13 | 1.27 | 1.17 | 1.24 | 1.73 | |
8 | 4.83 | 1.41 | 5.46 | 1.34 | 5.16 | 1.36 | |
16 | 9.73 | 1.37 | 11.03 | 1.32 | 10.32 | 1.34 | |
32 | 19.19 | 1.36 | 21.78 | 1.33 | 20.46 | 1.35 |
INT8 vs. FP16 speedup on Swin TINY/SMALL/BASE/LARGE:
Batch_size | TINY (FP16) | TINY (INT8) | Speedup | SMALL (FP16) | SMALL (INT8) | Speedup | BASE (FP16) | BASE (INT8) | Speedup | LARGE (FP16) | LARGE (INT8) | Speedup |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1.40 | 1.24 | 1.13 | 2.54 | 2.24 | 1.13 | 3.40 | 2.72 | 1.25 | 6.19 | 3.99 | 1.55 |
8 | 6.80 | 4.83 | 1.41 | 11.31 | 7.80 | 1.45 | 17.28 | 11.20 | 1.54 | 31.14 | 19.58 | 1.59 |
16 | 13.30 | 9.73 | 1.37 | 22.40 | 15.60 | 1.44 | 32.51 | 21.88 | 1.49 | 61.32 | 39.58 | 1.55 |
32 | 26.10 | 19.19 | 1.36 | 44.04 | 30.66 | 1.44 | 64.53 | 42.84 | 1.51 | 121.06 | 76.34 | 1.59 |
Here, torch.jit.trace
means using tracing to convert Torch model to TorchScript model and then profile its performace.
On chips with Ampere architectures (like A30, A100), user can use export NVIDIA_TF32_OVERRIDE=1
to enforce the program run under TF32, otherwise FP32 GEMM is used by default, which is much slower.
Batch_size | torch.jit.trace | cpp | speedup | trt plugin | speedup | torch op | speedup |
---|---|---|---|---|---|---|---|
1 | 7.04 | 1.58 | 4.46 | 1.61 | 4.37 | 1.92 | 3.67 |
8 | 6.99 | 4.89 | 1.43 | 5.30 | 1.32 | 4.89 | 1.43 |
16 | 11.82 | 8.78 | 1.35 | 9.56 | 1.24 | 8.89 | 1.33 |
32 | 22.74 | 16.91 | 1.34 | 18.44 | 1.23 | 16.88 | 1.35 |
Batch_size | torch.jit.trace | cpp | speedup | trt plugin | speedup | torch op | speedup |
---|---|---|---|---|---|---|---|
1 | 7.02 | 0.90 | 7.80 | 0.93 | 7.55 | 1.46 | 4.81 |
8 | 7.17 | 1.73 | 4.14 | 1.83 | 3.92 | 1.77 | 4.05 |
16 | 7.92 | 2.91 | 2.72 | 3.08 | 2.57 | 2.94 | 2.69 |
32 | 14.76 | 5.38 | 2.74 | 5.77 | 2.56 | 5.43 | 2.72 |
Batch_size | torch.jit.trace | cpp | speedup(vs FP16) | trt plugin | speedup(vs FP16) | torch op | speedup(vs FP16) |
---|---|---|---|---|---|---|---|
1 | 1.07 | 0.84 | 1.13 | 0.82 | 1.08 | 1.35 | |
8 | 1.62 | 1.07 | 1.73 | 1.06 | 1.63 | 1.09 | |
16 | 2.34 | 1.24 | 2.50 | 1.23 | 2.34 | 1.26 | |
32 | 3.96 | 1.36 | 4.27 | 1.35 | 4.07 | 1.33 |
INT8 vs. FP16 speedup on Swin TINY/SMALL/BASE/LARGE:
Batch_size | TINY (FP16) | TINY (INT8) | Speedup | SMALL (FP16) | SMALL (INT8) | Speedup | BASE (FP16) | BASE (INT8) | Speedup | LARGE (FP16) | LARGE (INT8) | Speedup |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.90 | 1.07 | 0.84 | 1.64 | 2.02 | 0.81 | 1.81 | 2.20 | 0.82 | 2.22 | 2.66 | 0.83 |
8 | 1.73 | 1.62 | 1.07 | 3.02 | 2.85 | 1.06 | 3.60 | 3.33 | 1.08 | 6.04 | 4.61 | 1.31 |
16 | 2.91 | 2.34 | 1.24 | 4.79 | 3.94 | 1.22 | 6.11 | 4.81 | 1.27 | 10.92 | 7.55 | 1.45 |
32 | 5.38 | 3.96 | 1.36 | 8.60 | 6.42 | 1.34 | 11.42 | 8.29 | 1.38 | 20.49 | 13.82 | 1.48 |
64 | 10.31 | 7.42 | 1.39 | 16.48 | 11.74 | 1.40 | 22.26 | 15.85 | 1.40 | 39.62 | 27.25 | 1.45 |