Group reward-Decoupled Normalization Policy Optimization

Gold definitionUpdated Apr 2, 2026

Definition

Group reward-Decoupled Normalization Policy Optimization (GDPO) is a policy optimization method that resolves issues in multi-reward reinforcement learning by decoupling the normalization of individual rewards. It preserves their relative differences, enabling more accurate optimization and substantially improved training stability, especially for aligning language models with diverse human preferences.

At a glance

Executive summary

Group reward-Decoupled Normalization Policy Optimization (GDPO) is a new method for training AI models using multiple reward signals. It improves upon existing techniques by preventing these signals from merging into a single, less informative value, which often leads to unstable training. GDPO ensures more accurate and stable learning, especially for making advanced language models behave according to diverse human preferences.

TL;DR

GDPO is a new way to train AI models with multiple goals by keeping each reward signal distinct, leading to more stable and accurate learning, especially for language models.

Key points

Decouples the normalization of individual rewards to preserve their unique contributions.
Solves the problem of reward signal collapse in multi-reward RL, improving stability and convergence.
Used in RL pipelines for language models to align with diverse human preferences.
Unlike GRPO, which can cause distinct reward combinations to collapse, GDPO maintains signal resolution.
Advances the field of multi-objective reinforcement learning for robust LLM alignment.

Use cases

Aligning large language models (LLMs) with complex and sometimes conflicting user preferences, such as helpfulness, harmlessness, and conciseness.
Training conversational AI agents to balance multiple objectives like being informative, engaging, and safe in their responses.
Optimizing autonomous agents in environments where success requires balancing competing reward signals, such as resource gathering versus defense.
Developing personalized recommendation systems that consider diverse user feedback dimensions beyond simple click-through rates.

Also known as

GDPO