FlashAttention

Gold definitionUpdated Apr 2, 2026

Definition

FlashAttention is a hardware-efficient attention mechanism that optimizes Transformer computation by reducing memory I/O, enabling faster training and inference for large language models and long-context applications. It leverages GPU architecture specifics to overcome the quadratic complexity bottleneck of standard attention.

At a glance

Executive summary

FlashAttention is a highly optimized method for computing attention in AI models, especially large language models, making them much faster and able to handle longer texts. It achieves this by intelligently using GPU hardware to reduce memory bottlenecks, significantly improving training and inference efficiency.

TL;DR

FlashAttention makes big AI models run much faster and process longer inputs by optimizing how they calculate attention on GPUs.

Key points

Optimizes attention computation by reducing memory I/O and leveraging GPU architecture specifics (e.g., asynchronous execution, warp specialization).
Overcomes the quadratic complexity and memory bottleneck of standard attention, enabling faster training/inference and longer context windows for Transformers.
Used by Large Language Models (LLMs), long-context applications, Video Diffusion Transformers, Super-Resolution Transformers, and other Transformer-based architectures.
Unlike standard attention implementations which are memory-bound and suffer from quadratic complexity, FlashAttention significantly reduces memory access and improves throughput.
Research trend focuses on continuous hardware-specific optimization (e.g., FlashAttention-4 for Blackwell GPUs) and integration into diverse Transformer applications to enhance scalability and efficiency.

Use cases

Accelerating the training phase of large language models by speeding up attention computation, allowing for faster iteration and larger models.
Enabling LLMs to process and generate responses for extremely long input sequences, such as entire documents or books, by efficiently handling the attention mechanism.
Improving the efficiency of video generation models (Video Diffusion Transformers) by allowing them to compute spatio-temporal attention over dynamic sparse patterns for long video sequences.
Scaling up image super-resolution models (Super-Resolution Transformers) by enabling them to use larger training patch sizes and self-attention windows without prohibitive computational burdens.

Also known as

FlashAttention-2, FlashAttention-3, FlashAttention-4