RAGEN (Reasoning AGENt, pronounced like "region") leverages reinforcement learning (RL) to train
LLM reasoning agents in interactive, stochastic environments.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.
2025.4.20 Update:
Our RAGEN paper is out!
We've further streamlined the RAGEN codebase (v0423) to improve development.
- Architecture: Restructured veRL as a submodule for better co-development
- Modularity: Divided RAGEN into three components—Environment Manager, Context Manager, and Agent Proxy, making it significantly simpler to add new environments (details below), track environmental dynamics, and run multiple experiments
2025.4.16 Update:
We recently noticed that a third-party website has been created using our project's name and content. While we appreciate the interest in the project, we’d like to clarify that this GitHub repository is the official and primary source for all code, updates, and documentation. If we launch an official website in the future, it will be explicitly linked here.
Thank you for your support and understanding!
2025.3.13 Update:
We are recently refactoring RAGEN code to help you better develop your own idea on the codebase. Please checkout our developing branch. The first version decomposes RAGEN and veRL for better co-development, taking the latter as a submodule rather than a static directory.
2025.3.8 Update:
- In previous veRL implementation, there is a KL term issue, which has been fixed in recent versions.
- We find evidence from multiple sources that PPO could be more stable than GRPO training in Open-Reasoner-Zero, TinyZero, and Zhihu. We have changed the default advantage estimator to GAE (using PPO) and aim to find more stable while efficient RL optimization methods in later versions.
Reinforcement Learning (RL) with rule-based rewards has shown promise in enhancing reasoning capabilities of large language models (LLMs). However, existing approaches have primarily focused on static, single-turn tasks like math reasoning and coding. Extending these methods to agent scenarios introduces two fundamental challenges:
- Multi-turn Interactions: Agents must perform sequential decision-making and react to environment feedback
- Stochastic Environments: Uncertainty where identical actions can lead to different outcomes
To address these challenges, we propose a general RL framework: StarPO (State-Thinking-Actions-Reward Policy Optimization), a comprehensive RL framework that provides a unified approach for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures. Building upon StarPO, we introduce RAGEN, a modular agent training and evaluation system that implements the complete training loop, including rollout generation, reward calculation, and trajectory optimization. RAGEN serves as a robust research infrastructure for systematically analyzing LLM agent training dynamics in multi-turn and stochastic environments.
RAGEN introduces a reinforcement learning framework to train reasoning-capable LLM agents that can operate in interactive, stochastic environments.
The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage. LLM iteratively generates reasoning-guided actions to interact with the environment to obtain trajectory-level rewards for LLM update to jointly optimize reasoning and action strategies.
The framework consists of two key components:
We formulate agent-environment interactions as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. At time t, state
StarPO is a general RL framework for optimizing entire multi-turn interaction trajectories for LLM agents. The algorithm alternates between two phases:
Given an initial state, the LLM generates multiple trajectories. At each step, the model receives the trajectory history and generates a reasoning-guided action: <think>...</think><ans> action </ans>
. The environment receives the action and returns feedback (reward and next state).
After generating trajectories, we train LLMs to optimize expected rewards. Instead of step-by-step optimization, StarPO optimizes entire trajectories using importance sampling. This approach enables long-horizon reasoning while maintaining computational efficiency. StarPO supports multiple optimization strategies:
- PPO: We estimate token-level advantages using a value function over trajectories
- GRPO: We assign normalized reward to the full trajectory
Rollout and update stages interleave in StarPO, enabling both online and offline learning.
For detailed setup instructions, please check our documentation. Here's a quick start guide:
# Setup environment and download data (2.7MB)
bash scripts/setup_ragen.sh
If this fails, you can follow the manual setup instructions in scripts/setup_ragen.md
.
Here's how to train models with RAGEN:
We provide default configuration in config/base.yaml
. This file includes symbolic links to:
config/ppo_trainer.yaml
config/envs.yaml
The base configuration automatically inherits all contents from these two config files, creating a unified configuration system.
To train:
python train.py --config-name base
Check val/generations
in wandb
We evaluate RAGEN across multiple environments. Below are results Qwen-2.5-0.5B-Instruct on Sokoban, Frozenlake, and Bandit.
- No KL loss or KL penalty was applied during training
- We selectively retained only the top 25% of trajectories that successfully completed their respective tasks
We demonstrate RAGEN's robust generalization ability by training on simple Sokoban environments (6×6 with 1 box) and successfully evaluating performance on:
- Larger Sokoban environments (8×8 with 2 boxes)
- Simple Sokoban with alternative grid vocabulary representations
- FrozenLake environments
Key observations:
- By using no KL and filtering out failed trajectories, we can achieve better and stable performance
- Generalization results highlight RAGEN's capacity to transfer learned policies across varying environment complexities, representations, and domains.
We implement RAGEN as a modular system: there are three main modules: Environment State Manager (ragen/llm_agent/es_manager.py
), Context Manager (ragen/llm_agent/ctx_manager.py
), and Agent Proxy (ragen/llm_agent/agent_proxy.py
).
- Environment State Manager (es_manager):
- Supports multiple environments (different environments, same environment different seeds, same environment same seed)
- Records states of each environment during rollout
- Processes actions from ctx_manager, executes step, and returns action results (observations) to ctx_manager in a batch-wise manner
- Context Manager (ctx_manager):
- Parses raw agent tokens into structured actions for the es_manager
- Formats observation from es_manager, parses and formulates them for following rollout of agent.
- Gathers final rollout trajectories and compiles them into tokens, attention masks, reward scores, and loss masks for llm updating.
- Agent Proxy (agent_proxy): Serves as the interface for executing single or multi-round rollouts
To add a new environment to our framework:
-
Implement an OpenAI Gym-compatible environment in
ragen/env/new_env/env.py
with these required methods:step(action)
: Process actions and return next statereset(seed)
: Initialize environment with new seedrender()
: Return current state observationclose()
: Clean up resources
-
Define environment configuration in
ragen/env/new_env/config.py
-
Register your environment in
config/envs.yaml
:custom_envs: - NewEnvironment # Tag - env_type: new_env # Must match environment class name - max_actions_per_traj: 50 # Example value - env_instruction: "Your environment instructions here" - env_config: {} # Configuration options from config.py
-
Add the environment tag to the
es_manager
section inconfig/base.yaml
RAGEN provides a easy way to evaluate a model:
python -m ragen.llm_agent.agent_proxy --config-name <eval_config>
You only need to set model and environment to evaluate in config/<eval_config>.yaml
We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.
- VAGEN: Training Visual Agents with multi-turn reinforcement learning
- Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning
- Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
- OpenManus-RL: A live stream development of RL tunning for LLM agents
- MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
*:Equal Contribution.
We thank the DeepSeek team for providing the DeepSeek-R1 model and early conceptual inspirations. We are grateful to the veRL team for their infrastructure support. We thank the TinyZero team for their discoveries that informed our initial exploration. We would like to appreciate insightful discussions with Han Liu, Xinyu Xing, Li Erran Li, John Schulman, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Matthew Khoriaty, Haoxiang Sun, Jiajun Liu.
@misc{RAGEN,
author = {Zihan Wang* and Kangrui Wang* and Qineng Wang* and Pingyue Zhang* and Linjie Li* and Zhengyuan Yang and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Monica Lam and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
title = {Training Agents by Reinforcing Reasoning},
year = {2025},
organization = {GitHub},
url = {https://github.com/ZihanWang314/ragen},
}