Agentic Turn-based Policy Optimization

Gold definitionUpdated Apr 2, 2026

Definition

Agentic Turn-based Policy Optimization (ATPO) is a turn-level learning objective for multi-turn agentic Reinforcement Learning. It aligns policy updates with the natural decision granularity of agentic interactions, addressing misaligned policy optimization in LLM agents.

At a glance

Executive summary

Agentic Turn-based Policy Optimization (ATPO) is a method to improve how AI agents, especially those using large language models, learn to complete tasks that involve many steps. It works by making sure the AI's learning adjustments happen at each distinct decision point, or "turn," in a task, which helps the AI learn more effectively from its actions.

TL;DR

It's a way to teach AI agents, particularly those using LLMs, to make better decisions in multi-step tasks by optimizing their learning process at each individual turn.

Key points

A turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions.
Addresses misaligned policy optimization and sparse credit assignment in multi-turn agentic Reinforcement Learning for LLM agents.
Used by researchers and ML engineers developing and refining LLM agents for complex multi-turn tasks.
Unlike traditional RL that might optimize over full episodes, ATPO focuses on turn-level optimization, providing finer-grained credit assignment.
Part of the growing field of Agentic Reinforcement Learning, focusing on post-training paradigms to refine LLM agent capabilities.

Use cases

Training LLM agents for complex dialogue systems requiring strategic reasoning and tool use, such as advanced customer service bots.
Enabling automated software engineering agents to iteratively debug code, interact with APIs, and use development tools over multiple steps.
Developing scientific discovery agents that conduct multi-step experiments, analyze results, and refine hypotheses by interacting with simulation environments.
Improving agents for robotics and control tasks where each turn involves sensing the environment and making a discrete decision, requiring fine-grained policy updates.

Also known as

ATPO, Turn-level Policy Optimization, Agentic Policy Optimization