Multi-head attention is a fundamental component of the Transformer architecture, designed to improve a model's ability to process sequential data by focusing on various parts of an input sequence simultaneously. Instead of a single attention function, it projects the input queries, keys, and values multiple times into different lower-dimensional subspaces. An attention function is then applied in parallel to each projected version, creating multiple 'attention heads.' The outputs from these heads are concatenated and linearly transformed back to the desired dimension. This parallel processing enables the model to learn different types of relationships—e.g., syntactic, semantic, or long-range dependencies—and integrate diverse contextual information. It significantly enhances the model's representational capacity and ability to handle complex sequences. Multi-head attention is widely used across various domains, including Natural Language Processing (NLP) in models like BERT and GPT, computer vision in Vision Transformers, and increasingly in generative models for tasks such as long-form video generation.
Multi-head attention is a core component of modern AI models, especially Transformers, allowing them to process different parts of an input simultaneously. It improves how models understand complex data by letting them focus on various relationships at once, though it can interact with other components to cause specific issues in generative tasks.
MHA, Multi-Headed Attention
Was this definition helpful?