Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
Bin Xiao committed May 25, 2021
1 parent f42d58b commit 56984ed
Show file tree
Hide file tree
Showing 35 changed files with 14,270 additions and 33 deletions.
130 changes: 122 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,129 @@
# Project
# Introduction
This is an official implementation of [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808). We present a new architecture, named Convolutional vision Transformers (CvT), that improves Vision Transformers (ViT) in performance and efficienty by introducing convolutions into ViT to yield the best of both disignes. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (e.g. shift, scale, and distortion invariance) while maintaining the merits of Transformers (e.g. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger dataset (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.

> This repo has been populated by an initial template to help get you started. Please
> make sure to update the content to build a great experience for community-building.
![](figures/pipeline.svg)

As the maintainer of this project, please make a few updates:
# Main results
## Models pre-trained on ImageNet-1k
| Model | Resolution | Param | GFLOPs | Top-1 |
|--------|------------|-------|--------|-------|
| CvT-13 | 224x224 | 20M | 4.5 | 81.6 |
| CvT-21 | 224x224 | 32M | 7.1 | 82.5 |
| CvT-13 | 384x384 | 20M | 16.3 | 83.0 |
| CvT-32 | 384x384 | 32M | 24.9 | 83.3 |

- Improving this README.MD file to provide a great experience
- Updating SUPPORT.MD with content about this project's support experience
- Understanding the security reporting process in SECURITY.MD
- Remove this section from the README
## Models pre-trained on ImageNet-22k
| Model | Resolution | Param | GFLOPs | Top-1 |
|---------|------------|-------|--------|-------|
| CvT-13 | 384x384 | 20M | 16.3 | 83.3 |
| CvT-32 | 384x384 | 32M | 24.9 | 84.9 |
| CvT-W24 | 384x384 | 277M | 193.2 | 87.6 |

You can download all the models from our [model zoo](https://1drv.ms/u/s!AhIXJn_J-blW9RzF3rMW7SsLHa8h?e=blQ0Al).


# Quick start
## Installation
Assuming that you have installed PyTroch and TorchVision, if not, please follow the [officiall instruction](https://pytorch.org/) to install them firstly.
Intall the dependencies using cmd:

``` sh
python -m pip install -r requirements.txt --user -q
```

The code is developed and tested using pytorch 1.7.1. Other versions of pytorch are not fully tested.

## Data preparation
Please prepare the data as following:

``` sh
|-DATASET
|-imagenet
|-train
| |-class1
| | |-img1.jpg
| | |-img2.jpg
| | |-...
| |-class2
| | |-img3.jpg
| | |-...
| |-class3
| | |-img4.jpg
| | |-...
| |-...
|-val
|-class1
| |-img5.jpg
| |-...
|-class2
| |-img6.jpg
| |-...
|-class3
| |-img7.jpg
| |-...
|-...
```


## Run
Each experiment is defined by a yaml config file, which is saved under the directory of `experiments`. The directory of `experiments` has a tree structure like this:

``` sh
experiments
|-{DATASET_A}
| |-{ARCH_A}
| |-{ARCH_B}
|-{DATASET_B}
| |-{ARCH_A}
| |-{ARCH_B}
|-{DATASET_C}
| |-{ARCH_A}
| |-{ARCH_B}
|-...
```

We provide a `run.sh` script for running jobs in local machine.

``` sh
Usage: run.sh [run_options]
Options:
-g|--gpus <1> - number of gpus to be used
-t|--job-type <aml> - job type (train|test)
-p|--port <9000> - master port
-i|--install-deps - If install dependencies (default: False)
```

### Training on local machine

``` sh
bash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml
```

You can also modify the config paramters by the command line. For example, if you want to change the lr rate to 0.1, you can run the command:
``` sh
bash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TRAIN.LR 0.1
```

Notes:
- The checkpoint, model, and log files will be saved in OUTPUT/{dataset}/{training config} by default.

### Testing pre-trained models

``` sh
bash run.sh -t test --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TEST.MODEL_FILE ${PRETRAINED_MODLE_FILE}
```

# Citation
If you find this work or code is helpful in your research, please cite:

```
@article{wu2021cvt,
title={Cvt: Introducing convolutions to vision transformers},
author={Wu, Haiping and Xiao, Bin and Codella, Noel and Liu, Mengchen and Dai, Xiyang and Yuan, Lu and Zhang, Lei},
journal={arXiv preprint arXiv:2103.15808},
year={2021}
}
```
## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Expand Down
25 changes: 0 additions & 25 deletions SUPPORT.md

This file was deleted.

83 changes: 83 additions & 0 deletions experiments/imagenet/cvt/cvt-13-224x224.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
OUTPUT_DIR: 'OUTPUT/'
WORKERS: 6
PRINT_FREQ: 500
AMP:
ENABLED: true

MODEL:
NAME: cls_cvt
SPEC:
INIT: 'trunc_norm'
NUM_STAGES: 3
PATCH_SIZE: [7, 3, 3]
PATCH_STRIDE: [4, 2, 2]
PATCH_PADDING: [2, 1, 1]
DIM_EMBED: [64, 192, 384]
NUM_HEADS: [1, 3, 6]
DEPTH: [1, 2, 10]
MLP_RATIO: [4.0, 4.0, 4.0]
ATTN_DROP_RATE: [0.0, 0.0, 0.0]
DROP_RATE: [0.0, 0.0, 0.0]
DROP_PATH_RATE: [0.0, 0.0, 0.1]
QKV_BIAS: [True, True, True]
CLS_TOKEN: [False, False, True]
POS_EMBED: [False, False, False]
QKV_PROJ_METHOD: ['dw_bn', 'dw_bn', 'dw_bn']
KERNEL_QKV: [3, 3, 3]
PADDING_KV: [1, 1, 1]
STRIDE_KV: [2, 2, 2]
PADDING_Q: [1, 1, 1]
STRIDE_Q: [1, 1, 1]
AUG:
MIXUP_PROB: 1.0
MIXUP: 0.8
MIXCUT: 1.0
TIMM_AUG:
USE_LOADER: true
RE_COUNT: 1
RE_MODE: pixel
RE_SPLIT: false
RE_PROB: 0.25
AUTO_AUGMENT: rand-m9-mstd0.5-inc1
HFLIP: 0.5
VFLIP: 0.0
COLOR_JITTER: 0.4
INTERPOLATION: bicubic
LOSS:
LABEL_SMOOTHING: 0.1
CUDNN:
BENCHMARK: true
DETERMINISTIC: false
ENABLED: true
DATASET:
DATASET: 'imagenet'
DATA_FORMAT: 'jpg'
ROOT: 'DATASET/imagenet/'
TEST_SET: 'val'
TRAIN_SET: 'train'
TEST:
BATCH_SIZE_PER_GPU: 32
IMAGE_SIZE: [224, 224]
MODEL_FILE: ''
INTERPOLATION: 3
TRAIN:
BATCH_SIZE_PER_GPU: 256
LR: 0.00025
IMAGE_SIZE: [224, 224]
BEGIN_EPOCH: 0
END_EPOCH: 300
LR_SCHEDULER:
METHOD: 'timm'
ARGS:
sched: 'cosine'
warmup_epochs: 5
warmup_lr: 0.000001
min_lr: 0.00001
cooldown_epochs: 10
decay_rate: 0.1
OPTIMIZER: adamW
WD: 0.05
WITHOUT_WD_LIST: ['bn', 'bias', 'ln']
SHUFFLE: true
DEBUG:
DEBUG: false
84 changes: 84 additions & 0 deletions experiments/imagenet/cvt/cvt-13-384x384.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
OUTPUT_DIR: 'OUTPUT/'
WORKERS: 6
PRINT_FREQ: 500
AMP:
ENABLED: true

MODEL:
NAME: cls_cvt
SPEC:
INIT: 'trunc_norm'
NUM_STAGES: 3
PATCH_SIZE: [7, 3, 3]
PATCH_STRIDE: [4, 2, 2]
PATCH_PADDING: [2, 1, 1]
DIM_EMBED: [64, 192, 384]
NUM_HEADS: [1, 3, 6]
DEPTH: [1, 2, 10]
MLP_RATIO: [4.0, 4.0, 4.0]
ATTN_DROP_RATE: [0.0, 0.0, 0.0]
DROP_RATE: [0.0, 0.0, 0.0]
DROP_PATH_RATE: [0.0, 0.0, 0.1]
QKV_BIAS: [True, True, True]
CLS_TOKEN: [False, False, True]
POS_EMBED: [False, False, False]
QKV_PROJ_METHOD: ['dw_bn', 'dw_bn', 'dw_bn']
KERNEL_QKV: [3, 3, 3]
PADDING_KV: [1, 1, 1]
STRIDE_KV: [2, 2, 2]
PADDING_Q: [1, 1, 1]
STRIDE_Q: [1, 1, 1]
AUG:
MIXUP_PROB: 1.0
MIXUP: 0.8
MIXCUT: 1.0
TIMM_AUG:
USE_LOADER: true
RE_COUNT: 1
RE_MODE: pixel
RE_SPLIT: false
RE_PROB: 0.25
AUTO_AUGMENT: rand-m9-mstd0.5-inc1
HFLIP: 0.5
VFLIP: 0.0
COLOR_JITTER: 0.4
INTERPOLATION: bicubic
LOSS:
LABEL_SMOOTHING: 0.1
CUDNN:
BENCHMARK: true
DETERMINISTIC: false
ENABLED: true
DATASET:
DATASET: 'imagenet'
DATA_FORMAT: 'jpg'
ROOT: 'DATASET/imagenet/'
TEST_SET: 'val'
TRAIN_SET: 'train'
TEST:
BATCH_SIZE_PER_GPU: 32
IMAGE_SIZE: [384, 384]
CENTER_CROP: False
MODEL_FILE: ''
INTERPOLATION: 3
TRAIN:
BATCH_SIZE_PER_GPU: 256
LR: 0.00025
IMAGE_SIZE: [384, 384]
BEGIN_EPOCH: 0
END_EPOCH: 300
LR_SCHEDULER:
METHOD: 'timm'
ARGS:
sched: 'cosine'
warmup_epochs: 5
warmup_lr: 0.000001
min_lr: 0.00001
cooldown_epochs: 10
decay_rate: 0.1
OPTIMIZER: adamW
WD: 0.05
WITHOUT_WD_LIST: ['bn', 'bias', 'ln']
SHUFFLE: true
DEBUG:
DEBUG: false
Loading

0 comments on commit 56984ed

Please sign in to comment.