Vision Transformers

Gold definitionUpdated Apr 2, 2026

Definition

Vision Transformers (ViTs) are deep learning models that apply the transformer architecture to computer vision tasks, processing images as sequences of patches. They leverage self-attention to capture global contextual information, enabling robust representations for various visual understanding challenges.

At a glance

Executive summary

Vision Transformers are advanced AI models that process images by looking at the whole picture through a mechanism called self-attention, rather than just small local areas. This allows them to understand complex visual scenes better and perform well in tasks like recognizing objects, answering questions about images, and guiding robots.

TL;DR

Vision Transformers are AI models that use a global self-attention mechanism on image patches to understand visual data, making them powerful for many computer vision tasks.

Key points

Uses self-attention on image patches to capture global contextual information and long-range dependencies.
Solves the limitation of CNNs in modeling global context, leading to richer and more robust visual representations.
Used by researchers in visual question answering, remote sensing, medical imaging, autonomous driving, and robotics.
Unlike CNNs that focus on local features, ViTs process images globally, enabling a broader understanding of scenes.
Current research trends focus on improving robustness to distribution shifts, semi-supervised training, and knowledge distillation for efficiency.

Use cases

Visual Question Answering (VQA) for interpreting images and answering questions about their content.
Remote sensing image classification for environmental monitoring, urban planning, and disaster response.
Automated prostate cancer detection from T2-weighted MRI scans in medical diagnostics.
Visuomotor policies in robotics for tasks requiring complex visual understanding and manipulation.
Pathology foundation models for analyzing whole-slide images and predicting molecular states of tissues.

Also known as

ViT, VTs, Swin-T, DINOv2 (as a teacher), Waypoint Diffusion Transformers (WiT)