Skip to content

ConVision Benchmark: A Contemporary Framework to Benchmark CNN and ViT Models

Notifications You must be signed in to change notification settings

krishnateja95/COVID19_Benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ConVision Benchmark: A Contemporary Framework to Benchmark CNN and ViT Models

🧑‍💻 Authors: Shreyas Bangalore Vijayakumar, Krishna Teja Chitty-Venkata, Kanishk Arya and Arun K. Somani

🏣 Affiliation: Iowa State University, Ames, IA

This repository is the official implementation of "ConVision Benchmark" paper

Table of Contents

đź“Ś About

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have shown remarkable performance in computer vision tasks, including object detection and image recognition. These models have evolved significantly in architecture, efficiency, and versatility. Concurrently, deep-learning frameworks have diversified, with versions that often complicate reproducibility and unified benchmarking. We propose ConVision Benchmark, a comprehensive framework in PyTorch, to standardize the implementation and evaluation of state-of-the-art CNN and ViT models. This framework addresses common challenges such as version mismatches and inconsistent validation metrics. As a proof of concept, we performed an extensive benchmark analysis on a COVID-19 dataset, encompassing nearly 200 CNN and ViT models in which DenseNet-161 and MaxViT-Tiny achieved exceptional accuracy with a peak performance of around 95%. Although we primarily used the COVID-19 dataset for image classification, the framework is adaptable to a variety of datasets, enhancing its applicability across different domains. Our methodology includes rigorous performance evaluations, highlighting metrics such as accuracy, precision, recall, F1 score, and computational efficiency (FLOPs, MACs, CPU, and GPU latency). The ConVision Benchmark facilitates a comprehensive understanding of model efficacy, aiding researchers in deploying high-performance models for diverse applications.

llama-bench

đź“Ś Citation

If you find this repository useful, please consider citing our paper:

@article{bangalore2024convision,
  title={ConVision Benchmark: A Contemporary Framework to Benchmark CNN and ViT Models},
  author={Bangalore Vijayakumar, Shreyas and Chitty-Venkata, Krishna Teja and Arya, Kanishk and Somani, Arun K},
  journal={AI},
  volume={5},
  number={3},
  pages={1132--1171},
  year={2024},
  publisher={MDPI}
}

Features

We implemented CNN and ViT models in such a way a single model file contains the entire description of model without having to import any extra module (except torch and torchvision). For example, efficientnet_b0.py file contains the full code for Efficientnet_b0 model. One can directly use the file for any Computer vision application.

We referred to several open source repositories for CNN and ViT models

CNN Models

Models
AlexNet ConvNext_family
DenseNet_family EfficientNet_family
GhostNet_family Inception_Family
MNASNet_family Mobilenet_family
NFNet_Family RegNet_family
ResNet_family Shufflenet_family
Squeezenet_family Vgg_family

ViT Models

Models
BoTNet_Family CCT_family
CaiT_family CrossFormer_family
CrossViT_family CvT_family
DeepViT_family EdgeNeXt_Family
Efficientformer_family FocalTransformer_Family
GC_ViT_Family LVT_Family
LeViT_family MLP_Mixer_Family
Max_ViT_Family MobileFormer_Family
PVT_Family PiT_family
PoolFormer_Family Region_ViT_family
SepViT_family Swin_family
T2T_family TNT_family
Twins_family VAN_Family
Vision_transformer

Getting Started

Create a Virtual Environment

conda create -n ConVision python=3.11
conda activate ConVision

Install Requirements

pip install -r requirements.txt 

Curate the target Dataset in the following format:

├── Dataset

│   ├── Train

│   ├──   ├── Class 1
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n

│   ├──   ├── Class 2
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n

│   ├──   ├── Class m
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n


│   ├── Test
│   ├──   ├── Class 1
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n

│   ├──   ├── Class 2
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n

│   ├──   ├── Class m
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n


│   ├── Val
│   ├──   ├── Class 1
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n

│   ├──   ├── Class 2
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n

│   ├──   ├── Class m
│   ├──   ├──   ├── Image 1
│   ├──   ├──   ├── Image 2
│   ├──   ├──   ├── Image n

Example to run ResNet18 from ResNet_family in CNN_Models

python train/train.py \
            --model_name 'ResNet18' \
            --model_family 'ResNet_family' \
            --model_type 'CNN_Models' \
            --using_bn \
            --port 29500 \
            --model_dir '' \
            --train_root '' \
            --test_root '' \
            --val_root '' \
            --epochs 500 \
            --batch_size 256 \
            --workers 3 \
            --lr_mode 'cosine' \
            --base_lr 0.1 \
            --warmup_epochs 25 \
            --warmup_lr 0.0 \
            --targetlr 0.0 \
            --momentum 0.9 \
            --weight_decay 0.00005 \
            --using_moving_average \
            --last_gamma \
            --print_freq 100 \
            --evaluate \

Example to run CvT_13_Model from CvT_family in ViT_Models

python train/train.py \
            --model_name 'CvT_13_Model' \
            --model_family 'CvT_family' \
            --model_type 'ViT_Models' \
            --using_bn \
            --port 29500 \
            --model_dir '' \
            --train_root '' \
            --test_root '' \
            --val_root '' \
            --epochs 500 \
            --batch_size 256 \
            --workers 3 \
            --lr_mode 'cosine' \
            --base_lr 0.1 \
            --warmup_epochs 25 \
            --warmup_lr 0.0 \
            --targetlr 0.0 \
            --momentum 0.9 \
            --weight_decay 0.00005 \
            --using_moving_average \
            --last_gamma \
            --print_freq 100 \
            --evaluate \

Accuracy Results on COVID-19 Dataset

Model Top-1 Accuracy (%) Top-1 Accuracy Epoch Recall (%) Recall Epoch Best Loss Best Loss Epoch
DenseNet-161 95.61 410 0.96 410 0.21 226
Vgg-13-bn 95.27 230 0.95 230 0.16 39
DenseNet-121 95.26 401 0.95 401 0.23 204
DenseNet-169 95.24 327 0.95 327 0.2 12
Vgg-19-bn 95.08 187 0.95 187 0.18 24
MaxVit-tiny 95.02 296 0.95 296 0.17 26
Vgg-11-bn 95.01 293 0.95 293 0.17 27
Vgg-16-bn 94.99 131 0.95 131 0.2 28
GoogLeNet 94.95 136 0.95 136 0.25 16
DenseNet-201 94.9 383 0.95 383 0.24 11
Vgg-16 94.27 130 0.94 130 0.21 20
EfficientNet-b4 94.25 290 0.94 290 0.19 30
MobileFormer-508M 94.15 129 0.94 129 0.25 24
MobileFormer-294M 94.11 125 0.94 125 0.26 9
Vgg-11 94.08 154 0.94 154 0.22 12
EfficientNet-b0 94.03 253 0.94 253 0.2 26
Vgg-13 94 90 0.94 90 0.22 11
MobileNetV2 93.97 442 0.94 442 0.26 17
Vgg-19 93.89 176 0.94 176 0.2 17
EfficientNet-b1 93.74 371 0.94 371 0.22 31
EfficientNet-b3 93.64 154 0.94 154 0.21 31
MobileFormer-151M 93.62 134 0.94 134 0.23 16
AlexNet 93.62 267 0.94 267 0.22 19
MobileFormer-214M 93.58 444 0.94 444 0.25 22
EfficientNet-v2-s 93.41 361 0.93 361 0.22 28
CCT-14-sine 93.3 120 0.93 120 0.24 33
MobileNetV1 93.27 154 0.93 154 0.3 460
MobileFormer-96M 93.21 89 0.93 89 0.3 26
CCT-7-sine 93.12 170 0.93 170 0.27 30
ResNet18 93.05 170 0.93 170 0.29 443
EfficientNet-b2 93 257 0.93 257 0.27 26
RegNet-y-16gf 93 45 0.93 45 0.31 11
CCT-14 92.9 206 0.93 206 0.24 26
EfficientNet-b5 92.87 382 0.93 382 0.26 38
ShuffleNetV2-x0-5 92.86 278 0.93 278 0.23 25
GhostNetV2 92.81 160 0.93 160 0.29 9
ResNet34 92.81 271 0.93 271 0.3 8
ShuffleNetV2-x2-0 92.78 395 0.93 395 0.31 12
RegNet-y-800mf 92.77 58 0.93 58 0.31 16
RegNet-y-1-6gf 92.75 61 0.93 61 0.25 14
ShuffleNetV2-x1-5 92.75 236 0.93 236 0.31 5
RegNet-y-8gf 92.72 41 0.93 41 0.29 15
RegNet-y-3-2gf 92.72 68 0.93 68 0.31 11
ShuffleNetV2-x1-0 92.6 488 0.93 447 0.28 15
MobileFormer-26M 92.58 95 0.93 95 0.26 18
ResNext50 92.53 58 0.93 58 0.28 10
MobileNet-V3-large 92.47 422 0.92 422 0.36 61
PvT-v2-b3 92.43 56 0.92 56 0.24 19
RegNet-x-8gf 92.43 218 0.92 218 0.32 5
RegNet-x-1-6gf 92.38 254 0.92 254 0.33 467
RegNet-y-32gf 92.34 64 0.92 64 0.3 13
MobileFormer-52M 92.34 135 0.92 135 0.26 16
RegNet-y-400mf 92.32 110 0.92 110 0.3 23
RegNet-x-16gf 92.24 434 0.92 434 0.3 8
ResNet50 92.15 233 0.92 233 0.31 19
MNASnet-05 92.15 499 0.92 499 0.39 499
RegNet-x-400mf 92.13 437 0.92 437 0.33 22
RegNet-x-32gf 92.09 488 0.92 488 0.32 452
RegNet-x-3-2gf 92.02 107 0.92 92 0.32 15
FocalTransformer-Tiny 91.99 392 0.92 392 0.28 32
MNASnet-13 91.99 499 0.92 499 0.45 499
PvT-v2-b2 91.97 121 0.92 121 0.25 16
CCT-7 91.96 78 0.92 78 0.27 28
Swin-ViT-Tiny-window7 91.84 378 0.92 378 0.28 24
MNASnet-075 91.82 499 0.92 499 0.46 499
RegNet-x-800mf 91.81 401 0.92 401 0.33 14
FocalTransformer-Small 91.76 269 0.92 269 0.29 35
EfficientNet-v2-m 91.74 450 0.92 450 0.31 38
wide-ResNet50 91.74 286 0.92 286 0.27 13
mobilenet-v3-small 91.74 469 0.92 469 0.39 34
PvT-v2-b4 91.63 80 0.92 80 0.26 46
GCViT-xxTiny 91.62 349 0.92 349 0.26 29
Swin-ViT-Base 91.6 340 0.92 340 0.27 31
MNASnet-10 91.56 499 0.92 499 0.42 499
Swin-ViT-Small-window7 91.56 369 0.92 369 0.28 33
PvT-v2-b5 91.48 46 0.91 46 0.25 19
ResNet101 91.44 481 0.91 481 0.32 14
DeepViT-S 91.44 340 0.91 340 0.27 32
Swin-ViT-Large-window7 91.37 402 0.91 402 0.29 33
Swin-ViT-Small 91.34 437 0.91 437 0.28 31
PvT-v2-b2-Linear 91.32 54 0.91 54 0.29 26
VAN-b0 91.32 257 0.91 257 0.28 238
PvT-v2-b1 91.32 35 0.91 35 0.25 22
GCViT-xTiny 91.28 169 0.91 169 0.26 33
ResNext101 91.25 358 0.91 358 0.33 6
Swin-ViT-Base-window7 91.15 407 0.91 407 0.29 34
T2T-ViT-T-24 91.12 130 0.91 130 0.3 32
Swin-ViT-Tiny 91.07 389 0.91 389 0.26 29
T2T-ViT-19 90.95 169 0.91 169 0.3 32
GCViT-Tiny 90.93 111 0.91 111 0.27 24
wide-ResNet101 90.9 456 0.91 456 0.4 5
GCViT-Tiny2 90.87 81 0.91 81 0.25 31
ResNet152 90.81 108 0.91 108 0.37 17
T2T-ViT-14-wide 90.79 121 0.91 114 0.32 15
CrossFormer-small 90.72 232 0.91 232 0.29 24
T2T-ViT-14 90.69 133 0.91 133 0.29 31
LVT 90.65 60 0.91 60 0.28 28
NFNet-F0 90.65 20 0.91 20 0.27 20
CrossFormer-base 90.63 332 0.91 332 0.28 40
PvT-v2-b0 90.62 71 0.91 71 0.28 24
CrossFormer-large 90.6 192 0.91 192 0.27 30
DeepViT-L 90.54 214 0.91 214 0.3 87
CrossFormer-tiny 90.54 419 0.91 419 0.3 31
PvT-Large 90.48 150 0.9 150 0.31 32
T2T-ViT-10 90.34 245 0.9 245 0.3 32
RegionViT-Small 90.32 135 0.9 135 0.29 15
T2T-ViT-14-resnext 90.26 236 0.9 236 0.29 26
VAN-b1 90.17 64 0.9 64 0.31 11
Twins-PCPVT-Large 90.11 58 0.9 58 0.28 21
Twins-SVT-Base 90.04 52 0.9 52 0.31 18
Sep-ViT-Small 90.03 71 0.9 71 0.31 25
VAN-b2 89.97 43 0.9 43 0.33 14
T2T-ViT-24 89.95 221 0.9 210 0.32 28
Sep-ViT-Base 89.85 104 0.9 104 0.28 22
Twins-PCPVT-Base 89.81 57 0.9 57 0.31 18
T2T-ViT-7 89.81 191 0.9 191 0.29 39
PvT-Tiny 89.79 153 0.9 153 0.32 36
RegionViT-Base 89.7 140 0.9 140 0.3 25
Twins-SVT-Large 89.7 63 0.9 63 0.3 13
Sep-ViT-Tiny 89.7 56 0.9 56 0.29 23
PvT-Small 89.67 87 0.9 87 0.32 32
RegionViT-Medium 89.64 179 0.9 179 0.3 17
T2T-ViT-T-19 89.58 104 0.9 104 0.32 24
T2T-ViT-12 89.54 229 0.9 229 0.29 33
PiT-XS 89.5 97 0.89 97 0.33 21
RegionViT-Tiny 89.47 157 0.89 157 0.32 19
TNT-Base 89.41 215 0.89 215 0.32 38
Twins-SVT-Small 89.39 57 0.89 57 0.31 21
T2T-ViT-T-14 89.39 134 0.89 134 0.31 31
Twins-PCPVT-Small 89.3 49 0.89 49 0.29 18
CrossViT-15-dagger 89.25 236 0.89 236 0.32 31
PiT-Small 89.11 103 0.89 103 0.34 18
CrossViT-9-dagger 89.04 135 0.89 135 0.33 34
CvT-21 89.04 286 0.89 286 0.37 24
CrossViT-Small 88.85 254 0.89 254 0.32 41
CrossViT-15 88.77 207 0.89 207 0.33 36
PiT-TI 88.74 120 0.89 120 0.35 16
CrossViT-Base 88.74 179 0.89 179 0.34 33
EdgeNeXt-BNHS-Xsmall 88.74 54 0.89 54 0.3 38
TNT-Small 88.72 171 0.89 171 0.31 45
ConvNext-Small 88.67 59 0.89 59 0.32 59
CrossViT-18 88.57 265 0.89 265 0.33 41
ViT-Small-patch8 88.41 43 0.88 43 0.32 43
Sep-ViT-Lite 88.35 27 0.88 27 0.32 27
ViT-Small-patch16 88.29 204 0.88 204 0.34 37
EdgeNeXt-BNHS-Small 88.27 253 0.88 253 0.35 42
CvT-13 88.21 68 0.88 68 0.35 27
CaiT-XS24 88.16 108 0.88 108 0.33 80
CrossViT-Tiny 87.99 218 0.88 218 0.32 46
CaiT-XS36 87.95 84 0.88 84 0.33 72
CrossViT-9 87.82 200 0.88 200 0.33 41
ViT-Tiny-patch16 87.67 182 0.88 182 0.36 39
CaiT-XXS24 87.51 102 0.88 102 0.35 88
ConvNext-Tiny 87.48 68 0.87 68 0.35 68
CaiT-S24 87.46 74 0.87 74 0.34 59
MLPMixer 87.43 60 0.87 60 0.37 6
ViT-Base-patch16 87.29 186 0.87 186 0.36 34
ConvNext-Base 87.24 47 0.87 47 0.35 47
ResMLP 87.11 38 0.87 38 0.36 31
EdgeNeXt-Small 87.09 69 0.87 69 0.35 54
CaiT-XXS36 87.05 74 0.87 74 0.35 74
EdgeNeXt-Base 86.93 54 0.87 54 0.36 46
ViT-Large-patch32 86.73 167 0.87 167 0.4 33
EdgeNeXt-Xsmall 86.65 57 0.87 57 0.37 42
ViT-Base-patch32 86.34 155 0.86 154 0.41 36
ViT-Small-patch32 86.06 129 0.86 129 0.43 25
EdgeNeXt-BNHS-Xxsmall 85.92 58 0.86 58 0.4 38
GCViT-Small 85.75 65 0.86 65 0.39 32
PoolFormer-S36 85.22 484 0.85 484 0.39 442
SqueezeNet-1-1 85.11 463 0.63 445 0.59 437
PoolFormer-S12 83.91 37 0.84 37 0.42 37
NFNet-F1 83.44 10 0.83 10 0.44 10
BoTNet 83.38 294 0.83 294 0.45 151
PiT-Base 81.51 20 0.82 20 0.47 20
PoolFormer-M36 81.38 20 0.81 20 0.48 14
PoolFormer-S24 80.38 38 0.8 38 0.49 33
PvT-Medium 75.84 4 0.76 4 0.6 4
SqueezeNet-1-0 45.54 2 0.46 9 1.02 1
Inception-Resnet-v2 34.69 18 0.35 18 1.06 16

Computational Results on COVID-19 Dataset

Model Number of Parameters (Million) MACs (billion) FLOPs (billion) Training Time per Epoch (s) CPU Latency (ms) GPU Latency (ms) Training Memory (GB) Inference Memory (MB)
CrossViT-9 8.07 1.54 1.85 28.48 16.75 7.16 0.17 31.28
RegNet-x-1-6gf 8.28 1.63 1.62 19.06 17.01 4.7 0.19 31.98
CrossViT-9-dagger 8.29 1.68 1.99 29.22 17.61 7.43 0.18 32.14
MobileFormer-294M 9.51 0.29 0.29 23.48 21.14 15.82 0.21 36.69
PiT-XS 10.16 1.1 1.1 18.32 12.31 3.5 0.18 39.2
RegNet-y-1-6gf 10.32 1.65 1.63 21.78 24.06 10.79 0.22 39.87
EfficientNet-b3 10.69 0.97 0.95 36.41 24.44 8.72 0.31 41.52
ResNet18 11.18 1.82 1.82 17.69 9.42 1.57 0.19 42.7
PoolFormer-S12 11.38 1.81 1.82 17.48 17.63 3.36 0.23 44.42
GCViT-xxTiny 11.44 1.96 2.14 47.39 27.75 9.31 0.28 46.19
CaiT-XXS24 11.72 2.18 2.53 54.46 37.67 11.8 0.28 45.11
MobileFormer-508M 12.06 0.5 0.51 29.69 26.1 15.85 0.28 46.47
PvT-Tiny 12.33 1.86 1.94 27.91 18.65 4.34 0.26 48.7
DenseNet-169 12.49 3.43 3.4 35.04 46.82 14.04 0.34 48.48
RegionViT-Tiny 13.31 2.31 2.43 52.56 31.93 11.85 0.3 50.98
VAN-b1 13.34 2.51 2.5 35.84 29.37 5.48 0.32 51.57
PvT-v2-b1 13.5 2.04 2.12 45.28 25.29 4.76 0.29 51.56
RegNet-x-3-2gf 14.29 3.22 3.2 24.6 27.22 6.37 0.3 55.01
ResMLP 14.94 3.01 3.01 32 14.76 2.93 0.27 58.06
CaiT-XXS36 17.06 3.24 3.77 80.26 57.31 17.66 0.42 65.6
EfficientNet-b4 17.55 1.58 1.54 49.72 32.27 10.61 0.48 68.25
EdgeNeXt-Base 17.91 2.92 2.95 45.6 26.18 5.54 0.38 69.99
RegNet-y-3-2gf 17.93 3.22 3.2 28.75 30.39 8.31 0.37 69.61
DenseNet-201 18.1 4.39 4.34 43.93 61.95 17.11 0.46 70.19
BoTNet 18.8 4.02 4.06 23.98 28.94 4.86 0.37 72.1
GCViT-xTiny 19.42 2.71 2.94 62.8 37.31 12.06 0.43 76.8
CvT-13 19.61 4.08 4.58 58.61 39.41 11.54 0.4 75.07
EfficientNet-v2-s 20.18 2.9 2.88 32.62 36.04 11.8 0.46 79.03
PoolFormer-S24 20.84 3.39 3.41 23.6 33.95 6.59 0.42 80.66
T2T-ViT-T-14 21.08 4.35 6.11 74.47 38.4 5.92 0.44 80.8
T2T-ViT-14 21.08 4.35 4.8 55.18 28.44 6.35 0.4 80.81
T2T-ViT-14-resnext 21.08 4.35 4.8 87.15 30.72 6.4 0.46 80.81
ResNet34 21.29 3.68 3.67 17.28 17.02 2.62 0.35 82.18
ViT-Small-patch8 21.37 16.76 16.76 184.67 96.67 4.8 0.55 83.08
ViT-Small-patch16 21.59 4.25 4.25 41.85 23.29 3.11 0.38 83.09
NFNet-F0 21.86 0.02 9.18 38.27 105.73 10.76 1.69 263.07
CCT-14-sine 21.91 5.12 5.53 50.1 27.48 4.8 0.4 84.13
CCT-14 21.91 5.12 5.53 50.08 27.86 4.89 0.4 84.13
PvT-v2-b2-Linear 22.04 3.76 3.91 87.54 44.64 10.11 0.49 85.04
ViT-Small-patch32 22.48 1.12 1.12 17.68 10.95 3.16 0.35 85.94
PiT-Small 22.78 2.42 2.42 31.43 21.77 3.65 0.39 87.48
ResNext50 22.99 4.29 4.26 28.87 32.07 4.12 0.45 88.61
TNT-Small 23.3 4.85 5.24 118.45 43.93 10.97 0.49 91.95
ResNet50 23.51 4.13 4.11 22.31 27.27 3.95 0.43 89.95
Twins-SVT-Small 23.55 2.82 2.82 42.83 31.37 8.28 0.44 90.98
PvT-Small 23.58 3.69 3.83 49.5 33.56 8.27 0.48 91.69
Twins-PCPVT-Small 23.59 3.68 3.68 47.79 36.44 7.33 0.47 90.47
T2T-ViT-14-wide 24.23 4.97 5.24 52.51 23.82 3.17 0.42 93.39
PvT-v2-b2 24.85 3.9 4.05 80.47 43.29 9.24 0.53 94.93
VAN-b2 26.06 5.01 5 64.12 57.48 10.47 0.6 100.61
CrossViT-Small 26.13 5.08 5.63 65 33.22 7.91 0.5 100.72
CaiT-XS24 26.2 4.87 5.4 84.85 59.42 11.93 0.56 100.48
DenseNet-161 26.48 7.84 7.78 52.73 73.24 14.4 0.63 103.1
CrossViT-15-dagger 27.48 5.49 6.13 67.44 37.02 8.9 0.53 106.18
Swin-ViT-Tiny-window7 27.5 4.37 4.51 62.9 30.61 6.26 0.52 106.45
Swin-ViT-Base-window7 27.5 4.37 4.51 62.94 31.31 6.45 0.52 106.45
Swin-ViT-Tiny 27.5 4.38 4.64 65.37 32.72 5.9 0.54 108.14
GCViT-Tiny 27.58 4.32 4.79 90.34 59.13 19.18 0.62 112.12
ConvNext-Tiny 27.81 4.46 4.47 60.66 25.91 3.65 0.52 107.11
EfficientNet-b5 28.35 2.46 2.41 67.67 45 12.76 0.72 110.67
FocalTransformer-Tiny 29.44 4.66 5.22 146.14 65.75 17.82 0.62 116
RegionViT-Small 29.79 5.19 5.35 84.44 44.81 11.9 0.59 114.75
CrossFormer-small 29.89 4.79 4.92 63.15 37.44 9.12 0.56 114.42
PoolFormer-S36 30.29 4.97 5 33.99 50.58 9.5 0.62 116.89
MaxVit-tiny 30.38 5.46 5.61 91.16 64.36 17.1 0.75 118.93
Sep-ViT-Tiny 30.4 4.28 4.53 60.46 32.76 6.82 0.56 116.46
CvT-21 31.24 6.54 7.21 89.71 60.54 18.49 0.63 119.56
GCViT-Tiny2 33.84 5.56 6.21 110.91 67.88 23.66 0.76 139.19
RegNet-y-8gf 37.17 8.04 8 41.54 49.33 6.97 0.71 143.66
RegNet-x-8gf 37.66 8.05 8.02 33.09 46.26 6.53 0.67 144.42
CaiT-XS36 38.19 7.25 8.05 125.61 87.81 16.97 0.81 146.36
T2T-ViT-T-19 38.64 7.8 9.81 111.49 54.7 7.3 0.75 147.87
T2T-ViT-19 38.64 7.8 8.5 91.81 43.65 8.18 0.71 147.88
RegionViT-Medium 40.41 7.22 7.43 106.99 55.75 16.46 0.76 155.33
CrossViT-18 42.42 8.21 9.05 99.93 48.94 9.79 0.78 163.05
ResNet101 42.51 7.86 7.83 33.93 49.66 7.91 0.75 162.89
PvT-Medium 43.31 6.46 6.69 76.58 53.59 14.11 0.82 167.01
Twins-PCPVT-Base 43.32 6.46 6.46 72.79 57.61 12.48 0.8 165.82
NFNet-F1 43.7 0.04 16.92 66.52 189.01 20.18 3.17 500.84
PvT-v2-b3 44.73 6.7 6.92 116.41 68.55 15.47 0.88 170.85
Sep-ViT-Small 45.78 7.07 7.48 93.93 50.93 11.43 0.84 176.7
CaiT-S24 46.44 8.63 9.35 118.42 80.53 12.09 0.91 177.99
Swin-ViT-Small-window7 48.79 8.54 8.77 107.99 56.78 12.5 0.9 188.27
Swin-ViT-Large-window7 48.79 8.55 8.77 125.08 58.6 12.65 0.92 188.46
Swin-ViT-Small 48.79 8.58 9.43 121.63 64.03 10.84 0.98 196.6
ConvNext-Small 49.44 8.7 8.7 108.96 45.36 6.8 0.91 189.73
GCViT-Small 50.11 7.87 8.57 135.43 73.2 18.98 1.05 199.24
FocalTransformer-Small 50.74 8.87 9.75 244.35 115.78 35.54 1.07 201.32
CrossFormer-base 51.2 8.96 9.19 107.93 63.41 17.06 0.94 196.1
RegNet-x-16gf 52.24 16.04 15.99 53.72 74.41 6.54 0.98 202.09
EfficientNet-v2-m 52.86 5.45 5.41 53.7 65.57 16.43 1.05 207.29
Inception-Resnet-v2 54.31 6.5 6.48 39 82.18 19.93 0.91 210.51
Twins-SVT-Base 55.3 8.36 8.36 92.75 58.72 11.19 0.99 214.47
PoolFormer-M36 55.32 8.76 8.8 48 75.19 9.47 1.07 211.63
AlexNet 57.02 0.71 0.71 17.03 8.95 0.51 0.85 217.5
ResNet152 58.15 11.6 11.56 46.75 72.23 12.03 1.04 222.54
DeepViT-L 58.38 12.16 13.19 184.18 82.44 13.83 1.12 223.34
PvT-Large 60.47 9.53 9.85 109 79.53 20.33 1.14 232.44
Twins-PCPVT-Large 60.48 9.52 9.53 102.84 82.13 18.06 1.12 231.77
PvT-v2-b4 62.04 9.82 10.14 165.58 97.98 23.3 1.22 237.76
T2T-ViT-T-24 63.49 12.7 15 154.36 71.54 9.34 1.17 243.59
T2T-ViT-24 63.49 12.7 13.69 134.95 63.92 10.35 1.13 243.6
TNT-Base 64.64 13.44 14.09 196.07 80.53 11.11 1.19 247.23
wide-ResNet50 66.84 11.45 11.42 33.16 60.34 4.02 1.11 257.28
Swin-ViT-Base 70.09 12.75 13.69 164.92 89.29 15.83 1.36 278.57
RegionViT-Base 71.67 12.79 13.07 161.49 85.48 16.52 1.3 276.61
PiT-Base 72.5 10.55 10.56 105.05 59.57 3.98 1.22 279.65
RegNet-y-16gf 80.57 16.01 15.96 58.87 83.78 8.36 1.42 316.53
Sep-ViT-Base 81.33 12.54 13.08 144.88 73.92 11.35 1.42 312.46
PvT-v2-b5 81.44 11.38 11.76 170.68 109.8 29.59 1.5 311.92
ViT-Base-patch16 85.65 16.86 16.87 136.99 65.71 4.78 1.39 327.43
ResNext101 86.75 16.54 16.47 69.47 95.97 7.93 1.55 336.15
ViT-Base-patch32 87.42 4.37 4.37 35.79 27.77 3.3 1.33 333.75
ConvNext-Base 87.55 15.37 15.38 171.72 73.34 6.8 1.53 334.24
CrossFormer-large 90.95 15.85 16.15 168.25 90.3 16.84 1.61 351.68
Twins-SVT-Large 98.25 14.83 14.84 146.98 85.83 11.1 1.69 379.97
CrossViT-Base 103.57 20.13 21.22 188.87 90.13 7.96 1.79 407.83
RegNet-x-32gf 105.3 31.88 31.81 100.28 132.68 6.82 1.85 402.73
wide-ResNet101 124.84 22.84 22.79 55.16 107.95 8.07 2.05 484.85
Vgg-11 128.78 7.61 7.61 20.46 40.37 0.95 1.97 491.25
Vgg-11-bn 128.78 7.63 7.62 25.71 40.52 1.08 1.99 491.3
Vgg-13 128.96 11.3 11.3 30.26 49.86 1.08 1.99 491.96
Vgg-13-bn 128.97 11.35 11.33 39.61 50.98 1.26 2.03 492.01
Vgg-16 134.27 15.47 15.47 35.48 60.87 1.3 2.07 513.09
Vgg-16-bn 134.28 15.52 15.49 45.63 62.06 1.52 2.12 513.16
Vgg-19 139.58 19.63 19.63 41.03 73.43 1.52 2.16 532.47
Vgg-19-bn 139.59 19.69 19.66 51.7 73.04 1.71 2.21 532.56
RegNet-y-32gf 141.34 32.4 32.34 88.81 146.35 8.37 2.41 551.99
ViT-Large-patch32 305.46 15.26 15.27 114.15 90.16 6.2 4.63 1165.68

Acknowledgement

About

ConVision Benchmark: A Contemporary Framework to Benchmark CNN and ViT Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages