RAGEN: Training Agents by Reinforcing Reasoning

RAGEN (Reasoning AGENt, pronounced like "region") leverages reinforcement learning (RL) to train
LLM reasoning agents in interactive, stochastic environments.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.

2025.4.20 Update:

Our RAGEN paper is out!

We've further streamlined the RAGEN codebase (v0423) to improve development.

Architecture: Restructured veRL as a submodule for better co-development
Modularity: Divided RAGEN into three components—Environment Manager, Context Manager, and Agent Proxy, making it significantly simpler to add new environments (details below), track environmental dynamics, and run multiple experiments

2025.4.16 Update:

We recently noticed that a third-party website has been created using our project's name and content. While we appreciate the interest in the project, we’d like to clarify that this GitHub repository is the official and primary source for all code, updates, and documentation. If we launch an official website in the future, it will be explicitly linked here.

Thank you for your support and understanding!

2025.3.13 Update:

We are recently refactoring RAGEN code to help you better develop your own idea on the codebase. Please checkout our developing branch. The first version decomposes RAGEN and veRL for better co-development, taking the latter as a submodule rather than a static directory.

2025.3.8 Update:

In previous veRL implementation, there is a KL term issue, which has been fixed in recent versions.
We find evidence from multiple sources that PPO could be more stable than GRPO training in Open-Reasoner-Zero, TinyZero, and Zhihu. We have changed the default advantage estimator to GAE (using PPO) and aim to find more stable while efficient RL optimization methods in later versions.

Overview

Reinforcement Learning (RL) with rule-based rewards has shown promise in enhancing reasoning capabilities of large language models (LLMs). However, existing approaches have primarily focused on static, single-turn tasks like math reasoning and coding. Extending these methods to agent scenarios introduces two fundamental challenges:

Multi-turn Interactions: Agents must perform sequential decision-making and react to environment feedback
Stochastic Environments: Uncertainty where identical actions can lead to different outcomes

To address these challenges, we propose a general RL framework: StarPO (State-Thinking-Actions-Reward Policy Optimization), a comprehensive RL framework that provides a unified approach for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures. Building upon StarPO, we introduce RAGEN, a modular agent training and evaluation system that implements the complete training loop, including rollout generation, reward calculation, and trajectory optimization. RAGEN serves as a robust research infrastructure for systematically analyzing LLM agent training dynamics in multi-turn and stochastic environments.

Algorithm

RAGEN introduces a reinforcement learning framework to train reasoning-capable LLM agents that can operate in interactive, stochastic environments.

The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage. LLM iteratively generates reasoning-guided actions to interact with the environment to obtain trajectory-level rewards for LLM update to jointly optimize reasoning and action strategies.

The framework consists of two key components:

> MDP Formulation

We formulate agent-environment interactions as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. At time t, state $s_t$ transitions to the next state through action $a_t$ following a transition function. The policy generates actions given the trajectory history. The objective is to maximize expected cumulative rewards across multiple interaction turns.

> StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

StarPO is a general RL framework for optimizing entire multi-turn interaction trajectories for LLM agents. The algorithm alternates between two phases:

Rollout Stage: Reasoning-Interaction Trajectories

Given an initial state, the LLM generates multiple trajectories. At each step, the model receives the trajectory history and generates a reasoning-guided action: <think>...</think><ans> action </ans>. The environment receives the action and returns feedback (reward and next state).

Update Stage: Multi-turn Trajectory Optimization

After generating trajectories, we train LLMs to optimize expected rewards. Instead of step-by-step optimization, StarPO optimizes entire trajectories using importance sampling. This approach enables long-horizon reasoning while maintaining computational efficiency. StarPO supports multiple optimization strategies:

PPO: We estimate token-level advantages using a value function over trajectories
GRPO: We assign normalized reward to the full trajectory

Rollout and update stages interleave in StarPO, enabling both online and offline learning.

Environment Setup

For detailed setup instructions, please check our documentation. Here's a quick start guide:

# Setup environment and download data (2.7MB)
bash scripts/setup_ragen.sh

If this fails, you can follow the manual setup instructions in scripts/setup_ragen.md.

Training Models

Here's how to train models with RAGEN:

Export variables and train

We provide default configuration in config/base.yaml. This file includes symbolic links to:

config/ppo_trainer.yaml
config/envs.yaml

The base configuration automatically inherits all contents from these two config files, creating a unified configuration system.

To train:

python train.py --config-name base

Visualization

Check val/generations in wandb

Performance

We evaluate RAGEN across multiple environments. Below are results Qwen-2.5-0.5B-Instruct on Sokoban, Frozenlake, and Bandit.

No KL loss or KL penalty was applied during training
We selectively retained only the top 25% of trajectories that successfully completed their respective tasks

We demonstrate RAGEN's robust generalization ability by training on simple Sokoban environments (6×6 with 1 box) and successfully evaluating performance on:

Larger Sokoban environments (8×8 with 2 boxes)
Simple Sokoban with alternative grid vocabulary representations
FrozenLake environments

Key observations:

By using no KL and filtering out failed trajectories, we can achieve better and stable performance
Generalization results highlight RAGEN's capacity to transfer learned policies across varying environment complexities, representations, and domains.

Modular System Design of RAGEN

We implement RAGEN as a modular system: there are three main modules: Environment State Manager (ragen/llm_agent/es_manager.py), Context Manager (ragen/llm_agent/ctx_manager.py), and Agent Proxy (ragen/llm_agent/agent_proxy.py).

Environment State Manager (es_manager):
- Supports multiple environments (different environments, same environment different seeds, same environment same seed)
- Records states of each environment during rollout
- Processes actions from ctx_manager, executes step, and returns action results (observations) to ctx_manager in a batch-wise manner
Context Manager (ctx_manager):
- Parses raw agent tokens into structured actions for the es_manager
- Formats observation from es_manager, parses and formulates them for following rollout of agent.
- Gathers final rollout trajectories and compiles them into tokens, attention masks, reward scores, and loss masks for llm updating.
Agent Proxy (agent_proxy): Serves as the interface for executing single or multi-round rollouts

Adding Custom Environments

To add a new environment to our framework:

Implement an OpenAI Gym-compatible environment in ragen/env/new_env/env.py with these required methods:
- step(action): Process actions and return next state
- reset(seed): Initialize environment with new seed
- render(): Return current state observation
- close(): Clean up resources
Define environment configuration in ragen/env/new_env/config.py

Register your environment in config/envs.yaml:

custom_envs:
  - NewEnvironment # Tag
    - env_type: new_env  # Must match environment class name
    - max_actions_per_traj: 50  # Example value
    - env_instruction: "Your environment instructions here"
    - env_config: {}  # Configuration options from config.py

Add the environment tag to the es_manager section in config/base.yaml

Evaluation

RAGEN provides a easy way to evaluate a model:

python -m ragen.llm_agent.agent_proxy --config-name <eval_config>

You only need to set model and environment to evaluate in config/<eval_config>.yaml

Feedback

We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.

Awesome work powered or inspired by RAGEN

VAGEN: Training Visual Agents with multi-turn reinforcement learning
Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
OpenManus-RL: A live stream development of RL tunning for LLM agents
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

Contributors

Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

*:Equal Contribution.

Acknowledgements

We thank the DeepSeek team for providing the DeepSeek-R1 model and early conceptual inspirations. We are grateful to the veRL team for their infrastructure support. We thank the TinyZero team for their discoveries that informed our initial exploration. We would like to appreciate insightful discussions with Han Liu, Xinyu Xing, Li Erran Li, John Schulman, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Matthew Khoriaty, Haoxiang Sun, Jiajun Liu.

Citation

@misc{RAGEN,
  author       = {Zihan Wang* and Kangrui Wang* and Qineng Wang* and Pingyue Zhang* and Linjie Li* and Zhengyuan Yang and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Monica Lam and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
  title        = {Training Agents by Reinforcing Reasoning},
  year         = {2025},
  organization = {GitHub},
  url          = {https://github.com/ZihanWang314/ragen},
}

Name		Name	Last commit message	Last commit date
Latest commit History 454 Commits
cases		cases
config		config
external		external
public		public
ragen		ragen
scripts		scripts
verl @ 797f999		verl @ 797f999
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
RAGEN.pdf		RAGEN.pdf
README.md		README.md
jobs.txt		jobs.txt
requirements.txt		requirements.txt
run_jobs.py		run_jobs.py
setup.py		setup.py
train.py		train.py
train_all.sh		train_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGEN: Training Agents by Reinforcing Reasoning

Overview

Algorithm

> MDP Formulation

> StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

Rollout Stage: Reasoning-Interaction Trajectories

Update Stage: Multi-turn Trajectory Optimization

Environment Setup

Training Models

Export variables and train

Visualization

Performance

Modular System Design of RAGEN

Adding Custom Environments

Evaluation

Feedback

Awesome work powered or inspired by RAGEN

Contributors

Acknowledgements

Citation

About

Releases

Packages

Contributors 7

Languages

License

RAGEN-AI/RAGEN

Folders and files

Latest commit

History

Repository files navigation

RAGEN: Training Agents by Reinforcing Reasoning

Overview

Algorithm

> MDP Formulation

> StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

Rollout Stage: Reasoning-Interaction Trajectories

Update Stage: Multi-turn Trajectory Optimization

Environment Setup

Training Models

Export variables and train

Visualization

Performance

Modular System Design of RAGEN

Adding Custom Environments

Evaluation

Feedback

Awesome work powered or inspired by RAGEN

Contributors

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages