Vision Transformers apply the self-attention mechanism of Transformers to image recognition by dividing an image into fixed-size patches and processing them as a sequence. This approach allows them to model global context and relationships between different parts of an image, leading to impressive performance on various vision benchmarks.
Vision Transformers (ViTs) are a class of neural network architectures that adapt the Transformer model, originally designed for natural language processing, to computer vision tasks. They treat images as sequences of patches, enabling them to capture global relationships within an image more effectively than traditional convolutional neural networks.
| Alternative | Difference | Papers (with Vision Transformers) | Avg viability |
|---|---|---|---|
| ResNet-18 | — | 1 | — |