Group Relative Policy Optimization (GRPO)

Gold definitionUpdated Apr 2, 2026

Definition

Group Relative Policy Optimization (GRPO) is a group-based reinforcement learning method primarily used for post-training large language models. It optimizes policies for complex tasks like reasoning trajectories or fine-grained multi-objective control, addressing challenges in areas from multi-modal alignment to efficient model collaboration.

At a glance

Executive summary

Group Relative Policy Optimization (GRPO) is an advanced reinforcement learning technique used to improve how large language models perform complex tasks. It helps models learn to reason better, generate specific outputs like molecules with desired properties, and collaborate efficiently, especially when a smaller model needs help from a larger one.

TL;DR

GRPO is a special training method that uses reinforcement learning to make big AI models better at complex tasks like reasoning and generating specific things, often by learning from groups of information.

Key points

A group-based reinforcement learning method for post-training and fine-tuning large language models.
Enhances complex reasoning, enables precise multi-objective control, and facilitates efficient collaborative decoding between models.
Used by researchers and engineers working on Large Audio Language Models, molecule generation with LLMs, and efficient Small Language Model reasoning.
Unlike simpler RL methods, GRPO explicitly considers groups of tokens/sequences and has identified theoretical properties like gradient biases and optimizer interaction effects that inform its specific design.
Focus on post-training and fine-tuning LLMs for specialized tasks, multi-modal alignment, and improving model efficiency and control.

Use cases

Improving reasoning in audio-language models by aligning acoustic-semantic representations, as demonstrated by the CORD framework.
Generating molecules with precise property constraints by using fragment-level optimization to minimize errors towards target values, exemplified by the MolGen framework.
Enabling efficient, collaborative reasoning between SLMs and LLMs, where SLMs dynamically invoke LLMs only for critical tokens to achieve high accuracy with reduced computational cost (e.g., RelayLLM framework).

Also known as

GRPO