Skip to content
/ RAGEN Public

RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.

License

Notifications You must be signed in to change notification settings

RAGEN-AI/RAGEN

Repository files navigation

RAGEN: Training Agents by Reinforcing Reasoning

RAGEN icon

RAGEN (Reasoning AGENt, pronounced like "region") leverages reinforcement learning (RL) to train
LLM reasoning agents in interactive, stochastic environments.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.

Documentation Blog Paper Post Experiment Log

2025.4.20 Update:

Our RAGEN paper is out!

We've further streamlined the RAGEN codebase (v0423) to improve development.

  1. Architecture: Restructured veRL as a submodule for better co-development
  2. Modularity: Divided RAGEN into three components—Environment Manager, Context Manager, and Agent Proxy, making it significantly simpler to add new environments (details below), track environmental dynamics, and run multiple experiments

2025.4.16 Update:

We recently noticed that a third-party website has been created using our project's name and content. While we appreciate the interest in the project, we’d like to clarify that this GitHub repository is the official and primary source for all code, updates, and documentation. If we launch an official website in the future, it will be explicitly linked here.

Thank you for your support and understanding!

2025.3.13 Update:

We are recently refactoring RAGEN code to help you better develop your own idea on the codebase. Please checkout our developing branch. The first version decomposes RAGEN and veRL for better co-development, taking the latter as a submodule rather than a static directory.

2025.3.8 Update:

  1. In previous veRL implementation, there is a KL term issue, which has been fixed in recent versions.
  2. We find evidence from multiple sources that PPO could be more stable than GRPO training in Open-Reasoner-Zero, TinyZero, and Zhihu. We have changed the default advantage estimator to GAE (using PPO) and aim to find more stable while efficient RL optimization methods in later versions.

Overview

Reinforcement Learning (RL) with rule-based rewards has shown promise in enhancing reasoning capabilities of large language models (LLMs). However, existing approaches have primarily focused on static, single-turn tasks like math reasoning and coding. Extending these methods to agent scenarios introduces two fundamental challenges:

  1. Multi-turn Interactions: Agents must perform sequential decision-making and react to environment feedback
  2. Stochastic Environments: Uncertainty where identical actions can lead to different outcomes

To address these challenges, we propose a general RL framework: StarPO (State-Thinking-Actions-Reward Policy Optimization), a comprehensive RL framework that provides a unified approach for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures. Building upon StarPO, we introduce RAGEN, a modular agent training and evaluation system that implements the complete training loop, including rollout generation, reward calculation, and trajectory optimization. RAGEN serves as a robust research infrastructure for systematically analyzing LLM agent training dynamics in multi-turn and stochastic environments.

Algorithm

RAGEN introduces a reinforcement learning framework to train reasoning-capable LLM agents that can operate in interactive, stochastic environments.

StarPO Framework

The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage. LLM iteratively generates reasoning-guided actions to interact with the environment to obtain trajectory-level rewards for LLM update to jointly optimize reasoning and action strategies.

The framework consists of two key components:

> MDP Formulation

We formulate agent-environment interactions as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. At time t, state $s_t$ transitions to the next state through action $a_t$ following a transition function. The policy generates actions given the trajectory history. The objective is to maximize expected cumulative rewards across multiple interaction turns.

> StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

StarPO is a general RL framework for optimizing entire multi-turn interaction trajectories for LLM agents. The algorithm alternates between two phases:

Rollout Stage: Reasoning-Interaction Trajectories

Given an initial state, the LLM generates multiple trajectories. At each step, the model receives the trajectory history and generates a reasoning-guided action: <think>...</think><ans> action </ans>. The environment receives the action and returns feedback (reward and next state).

Update Stage: Multi-turn Trajectory Optimization

After generating trajectories, we train LLMs to optimize expected rewards. Instead of step-by-step optimization, StarPO optimizes entire trajectories using importance sampling. This approach enables long-horizon reasoning while maintaining computational efficiency. StarPO supports multiple optimization strategies:

  • PPO: We estimate token-level advantages using a value function over trajectories
  • GRPO: We assign normalized reward to the full trajectory

Rollout and update stages interleave in StarPO, enabling both online and offline learning.

Environment Setup

For detailed setup instructions, please check our documentation. Here's a quick start guide:

# Setup environment and download data (2.7MB)
bash scripts/setup_ragen.sh

If this fails, you can follow the manual setup instructions in scripts/setup_ragen.md.

Training Models

Here's how to train models with RAGEN:

Export variables and train

We provide default configuration in config/base.yaml. This file includes symbolic links to:

  • config/ppo_trainer.yaml
  • config/envs.yaml

The base configuration automatically inherits all contents from these two config files, creating a unified configuration system.

To train:

python train.py --config-name base

Visualization

Check val/generations in wandb

Performance

We evaluate RAGEN across multiple environments. Below are results Qwen-2.5-0.5B-Instruct on Sokoban, Frozenlake, and Bandit.

  • No KL loss or KL penalty was applied during training
  • We selectively retained only the top 25% of trajectories that successfully completed their respective tasks

Bandit Simple Sokoban Frozen lake

We demonstrate RAGEN's robust generalization ability by training on simple Sokoban environments (6×6 with 1 box) and successfully evaluating performance on:

  • Larger Sokoban environments (8×8 with 2 boxes)
  • Simple Sokoban with alternative grid vocabulary representations
  • FrozenLake environments

Larger Sokoban Sokoban with Different Grid Vocabulary Frozen lake

Key observations:

  • By using no KL and filtering out failed trajectories, we can achieve better and stable performance
  • Generalization results highlight RAGEN's capacity to transfer learned policies across varying environment complexities, representations, and domains.

Modular System Design of RAGEN

We implement RAGEN as a modular system: there are three main modules: Environment State Manager (ragen/llm_agent/es_manager.py), Context Manager (ragen/llm_agent/ctx_manager.py), and Agent Proxy (ragen/llm_agent/agent_proxy.py).

  • Environment State Manager (es_manager):
    • Supports multiple environments (different environments, same environment different seeds, same environment same seed)
    • Records states of each environment during rollout
    • Processes actions from ctx_manager, executes step, and returns action results (observations) to ctx_manager in a batch-wise manner
  • Context Manager (ctx_manager):
    • Parses raw agent tokens into structured actions for the es_manager
    • Formats observation from es_manager, parses and formulates them for following rollout of agent.
    • Gathers final rollout trajectories and compiles them into tokens, attention masks, reward scores, and loss masks for llm updating.
  • Agent Proxy (agent_proxy): Serves as the interface for executing single or multi-round rollouts

Adding Custom Environments

To add a new environment to our framework:

  1. Implement an OpenAI Gym-compatible environment in ragen/env/new_env/env.py with these required methods:

    • step(action): Process actions and return next state
    • reset(seed): Initialize environment with new seed
    • render(): Return current state observation
    • close(): Clean up resources
  2. Define environment configuration in ragen/env/new_env/config.py

  3. Register your environment in config/envs.yaml:

    custom_envs:
      - NewEnvironment # Tag
        - env_type: new_env  # Must match environment class name
        - max_actions_per_traj: 50  # Example value
        - env_instruction: "Your environment instructions here"
        - env_config: {}  # Configuration options from config.py
  4. Add the environment tag to the es_manager section in config/base.yaml

Evaluation

RAGEN provides a easy way to evaluate a model:

python -m ragen.llm_agent.agent_proxy --config-name <eval_config>

You only need to set model and environment to evaluate in config/<eval_config>.yaml

Feedback

We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.

Awesome work powered or inspired by RAGEN

  • VAGEN: Training Visual Agents with multi-turn reinforcement learning
  • Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning
  • Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
  • OpenManus-RL: A live stream development of RL tunning for LLM agents
  • MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

Contributors

Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

*:Equal Contribution.

Acknowledgements

We thank the DeepSeek team for providing the DeepSeek-R1 model and early conceptual inspirations. We are grateful to the veRL team for their infrastructure support. We thank the TinyZero team for their discoveries that informed our initial exploration. We would like to appreciate insightful discussions with Han Liu, Xinyu Xing, Li Erran Li, John Schulman, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Matthew Khoriaty, Haoxiang Sun, Jiajun Liu.

Citation

@misc{RAGEN,
  author       = {Zihan Wang* and Kangrui Wang* and Qineng Wang* and Pingyue Zhang* and Linjie Li* and Zhengyuan Yang and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Monica Lam and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
  title        = {Training Agents by Reinforcing Reasoning},
  year         = {2025},
  organization = {GitHub},
  url          = {https://github.com/ZihanWang314/ragen},
}

About

RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published