model
5 papers · avg viability 6.8
GRPO is a reinforcement learning algorithm designed for optimizing policies in large language models. It addresses inefficiencies in current RL methods by adapting to heterogeneous data and improving sample efficiency, which is crucial for training complex reasoning models.
Group Relative Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm that aims to improve the efficiency of training large language models (LLMs) for reasoning tasks. It tackles the issue of static uniformity in standard RL paradigms by addressing uniform prompt sampling and a fixed number of rollouts. GRPO is particularly relevant for scenarios with heterogeneous and heavy-tailed data distributions, where traditional RL methods can be inefficient.
No reviews yet.