Group reward-Decoupled Normalization Policy Optimization (GDPO) is a policy optimization method that resolves issues in multi-reward reinforcement learning by decoupling the normalization of individual rewards. It preserves their relative differences, enabling more accurate optimization and substantially improved training stability, especially for aligning language models with diverse human preferences.
Group reward-Decoupled Normalization Policy Optimization (GDPO) is a new method for training AI models using multiple reward signals. It improves upon existing techniques by preventing these signals from merging into a single, less informative value, which often leads to unstable training. GDPO ensures more accurate and stable learning, especially for making advanced language models behave according to diverse human preferences.
GDPO
Was this definition helpful?