EAPO (Evidence-Augmented Policy Optimization) is an RL algorithm designed to improve LLM reasoning in long-context scenarios by addressing sparse outcome rewards. It introduces dense process supervision through a Group-Relative Evidence Reward and an Adaptive Reward-Policy Co-Evolution mechanism to enhance evidence quality.
EAPO is a new AI method that helps large language models (LLMs) reason better when dealing with very long texts. It does this by giving the LLM more specific feedback on *how* it finds information, rather than just whether its final answer is right, which helps it avoid making lucky guesses and improves its ability to use evidence.
Evidence-Augmented Policy Optimization
Was this definition helpful?