Amalkrishna A S B200729CS, Emanuel Christo B200715CS

1. Introduction

Transformers have become the state-of-the-art models for many NLP tasks including text generation, sequence tagging, summarization etc. However in the field of computer vision, convolutional neural networks like ResNet have been the de facto standard for many years.

Vision Transformers try to make use of the same transformer architecture for computer vision tasks efficiently and with minimal modifications to the architecture. In our project, we mainly explore the use of ViTs for image classification tasks. We further explore the task of Video classification using ViTs.

2. Literature Survey

2.1 An Image is Worth 16x16 Words - Transformers for Image Recognition at Scale

2.1.1 Introduction

While there have been previous attempts at using transformers for vision-based tasks, most of them were inefficient and non-scalable. The naive use of self-attention on images would require us to attend to every pixel with every other pixel which could result in a quadratic time-complexity and thus is not a scalable option. The most similar model to ViTs was proposed by Cordonnier et al. in 2020. which uses 2x2 image patches and is thus suitable for low-resolution images only. As we will see in section 2.1.3 ViT uses larger 16x16 image patches.

2.1.2 Key idea

The primary idea behind the model is to convert the image into fixed-size patches, compute their linear embedding and feed them into a transformer encoder module as we do with tokenised sentences. Each image patch can be treated as a word embedding in a sentence. Position embedding is added to each patch for the model to learn spatial information and global image context.

2.1.3 Methodology and Architecture

Screenshot 2023-11-30 at 14.27.29.png

The overall ViT architecture ****as illustrated in the figure, uses a standard transformer encoder along with a block to generate embeddings from the input images. Positional embeddings will be learnable parameters added to each of the patches while giving the input. Another learnable parameter, class embedding is added to capture the global context of the image

2.1.4 Constructing Feature Vectors

$$ z_0 =[𝐱_{class};\ 𝐱_1^𝑝𝐄;\ 𝐱_2^𝑝𝐄;\ ⋯;\ 𝐱_𝑁^𝑝𝐄]+𝐄_{pos},\ \ 𝐄∈ℝ^{(𝑃^2𝐶)×𝐷},\ 𝐄_{pos} ∈ℝ^{(𝑁+1)×𝐷} $$

The equation above illustrates the generation of the feature vectors

The model uses a constant embedding size of $D$ (taken as 768 for the ViT-Base model in the paper) throughout all its layers.
The variables $𝐱_i^𝑝𝐄$ are all patch embeddings and each of these patches will have a dimension of $(𝑃^2𝐶)×𝐷$ ($P=16$ and $C=3$ in our case). $C$ is the number of colour channels.