Self-attention is a fundamental mechanism in neural networks, particularly prominent in the Transformer architecture, that enables a model to weigh the significance of different elements within an input sequence when processing each element. Unlike recurrent neural networks that process sequences sequentially, self-attention allows for parallel computation and direct modeling of dependencies between any two positions, regardless of their distance. It works by computing query, key, and value vectors for each input element, then calculating attention scores by dot-producting queries with keys, followed by a softmax function to obtain weights. These weights are then applied to the value vectors, producing a weighted sum that represents the contextualized representation of each element. This mechanism is crucial for capturing long-range dependencies and understanding contextual relationships, solving the limitations of fixed-size context windows in RNNs and the computational cost of CNNs for global context. It is widely used across natural language processing (NLP), computer vision, and increasingly in areas like video understanding, brain disorder diagnosis, and spatio-temporal graph forecasting, as evidenced by recent research.
Grounded in 3 research papers
Self-attention is a key component in modern AI models, especially Transformers, allowing them to understand how different parts of an input relate to each other. It helps models focus on the most relevant information, improving performance in tasks like language processing, video analysis, and medical diagnosis.
Multi-head Self-Attention, Local Self-Attention, Sparse Self-Attention, Axial Attention, Linear Attention, Performer, Reformer
Was this definition helpful?