Recent advancements in reinforcement learning are increasingly focused on enhancing adaptability and efficiency across diverse applications. Approaches like hierarchical reinforcement learning frameworks are being developed to leverage accumulated skills for improved reasoning in complex tasks, while frameworks that utilize next-state signals are enabling agents to learn continuously from interactions without extensive retraining. The introduction of meta-reinforcement learning techniques allows agents to refine their search strategies based on past experiences, enhancing exploration capabilities. Additionally, innovations in automatic environment generation are streamlining the creation of high-performance RL settings, significantly reducing the engineering burden. These developments are particularly relevant for commercial applications, such as personal assistants and robotics, where ongoing learning and adaptability are crucial. The field is shifting toward more integrated and scalable solutions, addressing the limitations of traditional methods and paving the way for more robust, real-world implementations.
The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation wi...
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a liv...
Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generi...
We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that o...
This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode ...
Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possibl...
Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent ...
Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable ...
Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target do...
Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. Howeve...