multi-head attention

Multi-head attention is a fundamental component of the Transformer architecture, designed to improve a model's ability to process sequential data by focusing on various parts of an input sequence simultaneously. Instead of a single attention function, it projects the input queries, keys, and values multiple times into different lower-dimensional subspaces. An attention function is then applied in parallel to each projected version, creating multiple 'attention heads.' The outputs from these heads are concatenated and linearly transformed back to the desired dimension. This parallel processing enables the model to learn different types of relationships—e.g., syntactic, semantic, or long-range dependencies—and integrate diverse contextual information. It significantly enhances the model's representational capacity and ability to handle complex sequences. Multi-head attention is widely used across various domains, including Natural Language Processing (NLP) in models like BERT and GPT, computer vision in Vision Transformers, and increasingly in generative models for tasks such as long-form video generation.

Core Mechanism of Multi-Head Attention

Parallel Attention Heads: Multi-head attention performs several attention computations in parallel. The input queries, keys, and values are linearly projected into different lower-dimensional spaces for each 'head,' allowing the model to learn distinct sets of attention weights.
Joint Information Capture: After independent attention computations, the outputs from all heads are concatenated. This combined representation is then linearly transformed, enabling the model to jointly consider information from different representation subspaces and positions, enriching its understanding of the input.

Advantages of Multi-Head Attention

Enhanced Representational Capacity: By allowing the model to attend to different parts of the input in parallel, multi-head attention significantly increases its capacity to learn diverse and complex relationships. Each head can specialize in capturing a particular type of dependency or feature.

Core Mechanism of Multi-Head Attention

Advantages of Multi-Head Attention

Challenges and Interactions of Multi-Head Attention

Sources

At a glance

Executive summary

TL;DR

Key points

Use cases

Also known as

Related papers

Related topics