Skip to content

Latest commit

 

History

History

VisionTransformer

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale

Contents

Summary

In Computer Vision, attention is applied in conjunction with convolutional networks or used to replace components of convolutional networks keeping their overal structure in place.

Approach

The image is split into patches and the sequence of linear embeddings of these patches are provided as input to a Transformer. Image patches are treated the same was as tokens in NLP.

When trained pre-trained at a large scale and transferred to smaller tasks, ViT approaches SoTA Image Recognition benchmarks.

Architecture

The model design follows the original Transformer.

Layout

Vision Transformer

The standard Transformer receives as input a 1D sequence of token embeddings. The image image is reshaped into sequence of flattened 2D patches patch where (H,W)is the resolution of the original image, C is the number of channels, (P,P) is the resolution of each image patch and N is the resulting number of patches, the input sequence length for the transformer.

The Transformer uses a constant latent vector of size D through all of its layers. The flattened patches are mapped to D dimensions with a trainable linear projection.

A learnable embedding to the sequence of embedded patches embed is prepended whose state at the output of the Transformer encoder encoded serves as the image representation y. During both pre-training and fine-tuning, a classification head is attached to encoded. The classification head is implemented by a MLP with one hidden layer at pre-training and by a single linear layer at fine-tuning.

Standard learnable 1D Position embeddings are added to the patch embeddings to retain positional information. The resulting sequence of embedding vectors serves as input to the encoder.

The MLP contains two layers with a GELU non-linearity.

1

2

3

4

Inductive Bias

The Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, two-dimensional neighbourhood structure and translation equivariance are built in each layer of the whole model, whereas, in ViT, only the MLP layers are local and translationally equivariant while the self-attention layers are global.

Fine-Tuning

The ViT is pre-trained on large datasets and then fine-tuned to downstream tasks. The pre-trained prediction head is remobed and a zero-initialized feed-forward layer is attached. 2D interpolation are preformed on the pre-trained position embeddings. When feeding images of higher resolution, the patch size is same which results in a larger effective sequence length.

Implementation