ViT (Vision Transformer)

The Vision Transformer (ViT) represents a paradigm shift in computer vision, adapting the highly successful transformer architecture from natural language processing to image recognition. Introduced by Google Brain, ViT processes images by dividing them into a grid of non-overlapping patches, which are then treated as a sequence of 'words' or tokens. Each patch is linearly embedded and combined with positional encodings before being fed into a standard transformer encoder. This approach bypasses the need for convolutional layers, demonstrating that transformers can achieve competitive or superior performance on image tasks, especially when pre-trained on large datasets. ViT's significance lies in challenging the long-standing dominance of Convolutional Neural Networks (CNNs) in computer vision, proving the versatility and scalability of the transformer architecture. It is widely used by researchers and major tech companies for various computer vision applications, from image classification to more complex tasks like object detection and segmentation.

Core Architecture of ViT

Image Patching and Linear Embedding in ViT: ViT divides an input image into a grid of fixed-size, non-overlapping patches. Each patch is then flattened into a 1D vector and linearly projected into a higher-dimensional embedding space, forming a sequence of tokens analogous to words in NLP.
Positional Embeddings for ViT: To retain spatial information lost during the flattening process, learnable positional embeddings are added to the patch embeddings. This allows the transformer encoder to understand the relative positions of different image patches within the original image.
Transformer Encoder in ViT: The sequence of embedded patches, along with a special 'classification token', is fed into a standard transformer encoder. This encoder consists of multiple layers of multi-head self-attention and feed-forward networks, enabling the model to capture global dependencies between image patches.

Performance and Training of ViT

Large-Scale Pre-training for ViT

Core Architecture of ViT

Performance and Training of ViT

Variants and Applications of ViT

At a glance

Executive summary

TL;DR

Key points

Use cases

Also known as

Related topics