Primary contact: Yitian Zhang
Background
: ViTs are the same architecture but only differ in embedding dimensions, a large ViT can be transformed to represent small models by uniformly slicing the weight matrix at each layer, e.g., ViT-B (r=0.5) equals ViT-S.Target
: Broad slicing bound to ensure the diversity of sub-networks; Fine-grained slicing granularity to ensure the number of sub-networks; Uniform slicing to align with the inherent design of ViT to vary from widths.Contribution
:- (1) Detailed analysis of the slimmable ability between different architectures
- (2) Propose Scala to learn slimmable representation for flexible inference
- python 3.7
- pytorch 1.8.1
- torchvision 0.9.1
- timm 0.3.2
Please follow the instruction of DeiT to prepare the ImageNet-1K dataset.
Here we provide the pretrained Scala building on top of DeiT-S which are trained on ImageNet-1K for 100 epochs:
Model | Acc1. ( |
Acc1. ( |
Acc1. ( |
Acc1. ( |
---|---|---|---|---|
Separate Training | 45.8% | 65.1% | 70.7% | 75.0% |
Scala-S (X=25) | 58.4% | 67.8% | 73.1% | 76.2% |
Scala-S (X=13) | 58.7% | 68.3% | 73.3% | 76.1% |
Scala-S (X=7) | 59.8% | 70.3% | 74.2% | 76.5% |
Scala-S (X=4) | 59.8% | 72.0% | 75.6% | 76.7% |
We also provide Scala building on top of DeiT-B which are trained on ImageNet-1K for 300 epochs:
Model | Acc1. ( |
Acc1. ( |
Acc1. ( |
Acc1. ( |
---|---|---|---|---|
Separate Training | 72.2% | 79.9% | 81.0% | 81.8% |
Scala-B (X=13) | 75.3% | 79.3% | 81.2% | 82.0% |
Scala-B (X=7) | 75.3% | 79.7% | 81.4% | 82.0% |
Scala-B (X=4) | 75.6% | 80.9% | 81.9% | 82.2% |
- Slicing Granularity and Bound
- Application on Hybrid and Lightweight structures
- Slimmable Ability across Architectures
-
Transferability
- Whether the slimmable representation can be transferred to downstream tasks? We first pre-train on ImageNet-1K for 300 epochs and then conduct linear probing on video recognition dataset UCF101. We make the classification head slimmable as well to fit the features with various dimensions and the results imply the great transferability of the slimmable representation.
- Whether the generalization ability can be maintained in the slimmable representation? When leveraging the vision foundation model DINOv2 as the teacher network, we follow prior work Proteus and remove all the Cross-Entropy losses during training to alleviate the dataset bias issue and inherit the strong generalization ability of the teacher network. The results are shown in the table and the delivered Scala-B with great generalization ability can be downloaded from the link.
- Specify the directory of datasets with
IMAGENET_LOCATION
inrun.sh
. - Specify the smallest slicing bound
$s$ smallest_ratio
, the largest slicing bound$l$ largest_ratio
and slicing granularity$\epsilon$ granularity
to determine$X$ (number of subnets).
0.25 | 1.0 | 0.03125 | 25 |
0.25 | 1.0 | 0.0625 | 13 |
0.25 | 1.0 | 0.125 | 7 |
0.25 | 1.0 | 0.25 | 4 |
- Run
bash run.sh
.
- Specify the directory of datasets with
IMAGENET_LOCATION
and the pretrained model withMODEL_PATH
ineval.sh
. - Specify the width ratio with
eval_ratio
ineval.sh
. - Run
bash eval.sh
.