Gated Linear Units (GLUs)

Gold definitionUpdated Apr 2, 2026

Definition

Gated Linear Units (GLUs) are neural network layers that apply an element-wise product of two linear transformations, one gated by a sigmoid function. This structure enables more flexible information flow, enhancing efficiency and scalability in models like Mixture-of-Experts.

At a glance

Executive summary

Gated Linear Units (GLUs) are a type of neural network layer that uses a special "gate" to control information flow, making models more efficient and scalable. They are particularly useful in large AI models and for complex tasks like forecasting on big datasets, helping to reduce computational costs while maintaining high accuracy.

TL;DR

GLUs are special neural network layers that use a gate to make big AI models run faster and more efficiently by controlling how information moves through them.

Key points

Involves an element-wise product of two linear transformations, one gated by a sigmoid, allowing selective information flow.
Addresses computational costs and memory consumption, enabling efficient and scalable processing for large-scale, long-horizon predictions.
Used by researchers and ML engineers in areas like large language models (LLMs), sequence modeling, and spatial-temporal graph forecasting (e.g., FaST framework).
Replaces traditional feed-forward networks (FFNs) with a gated mechanism, offering more flexible information control and often better efficiency.
Increasingly adopted in Mixture-of-Experts (MoE) architectures and large transformer models (e.g., SwiGLU, GeGLU) for improved performance and scalability.

Use cases

Enhancing the efficiency and performance of transformer-based Large Language Models (LLMs), such as in Google's PaLM or Meta's LLaMA, where variants like SwiGLU are common.
Used in frameworks like FaST for Spatial-Temporal Graph (STG) forecasting to enable efficient and accurate long-horizon predictions on large-scale networks, such as traffic or weather forecasting.
Improving the capacity and training stability of models for sequence modeling tasks like machine translation or speech recognition, where complex dependencies need to be captured efficiently.
Providing an efficient and scalable alternative to FFNs within Mixture-of-Experts (MoE) layers, allowing models to scale to billions of parameters while maintaining manageable inference costs.

Also known as

SwiGLU, GeGLU, ReGLU, GLU-variant FFNs