Code for Feedback Generation, Reward Learning, and RL Benchmarks
This repository contains code for training and evaluating reinforcement learning agents using various types of feedback.
train_baselines/
: Main training scripts (Main is a fork ofStable Baselines3 Zoo
, not by the authors of this repository)multi_type_feedback/
: Scripts for reward model training and agent training with learned rewardssetup.sh
: Setup script for the environmentdependencies/stable-baselines3/
: A slightly modified version of Stable Baselines (fix with gymnasium==1.0.0a2), not by the authors of this repositorydependencies/masksembles/
: Masksembles implementation, not by the authors of this repository
Trains PPO agents on various environments:
python train_baselines/train.py --algo ppo --env <environment> --verbose 0 --save-freq <frequency> --seed <seed> --gym-packages procgen ale_py --log-folder train_baselines/gt_agents
Environments: Ant-v5, Swimmer-v5, HalfCheetah-v5, Hopper-v5, Atari, Procgen, ... Info: Please make sure to use train_baselines/gt_agents as the log folder, to esnure compatability with generation-script, however you can adapt the expert model dirs in necessary.
Generates feedback for trained agents:
python multi_type_feedback/generate_feedback.py --algorithm ppo --environment <env> --seed <seed> --n-feedback 10000 --save-folder feedback
Note: The script looks in the gt_agents folder for trained agents. Abd expects that the python train_baselines/benchmark_envs.py
script has been run to generate the evaluation scores.
Trains reward models based on generated feedback:
python multi_type_feedback/train_reward_model.py --algorithm ppo --environment <env> --feedback-type <type> --seed <seed> --feedback-folder feedback --save-folder reward_models
Feedback types: evaluative, comparative, demonstrative, corrective, descriptive, descriptive_preference
Trains agents using the learned reward models:
python multi_type_feedback/train_RL_agent.py --algorithm ppo --environment <env> --feedback-type <type> --seed <seed>
5. Agent Training with Learned Reward Function Ensemble (multi_type_feedback/train_agent_ensemble.py
)
Trains agents using the learned reward models:
python multi_type_feedback/train_RL_agent_with_ensemble.py --algorithm ppo --environment <env> --feedback-types <types> --seed <seed>
Feedback types: evaluative, comparative, demonstrative, corrective, descriptive, descriptive_preference
- Install the package using
pip install -e .
- Run initial training (e.g. with
train_baselines/start_training.sh
) - Generate feedback
- Train reward models
- Train agents with learned rewards
For detailed parameters and options, refer to the individual script files.
train_baselines/benchmark_envs.py
: Benchmark trained agents on various environmentsmulti_type_feedback/Analyze_Generated_Feedback.ipynb
: Jupyter notebook for analyzing generated feedbackmulti_type_feedback/Analyze_Reward_Model_Predictions.ipynb
: Jupyter notebook for analyzing reward modelsmulti_type_feedback/Generate_RL_result_curves.ipynb
: Jupyter notebook for generating RL result curves
and more...
- Mujoco
- Procgen
- Atari
- potentially other Gym environments
- This repository uses CUDA for GPU acceleration. Ensure proper CUDA setup before running.
- The training scripts are designed to distribute jobs across multiple GPUs.
- For large-scale experiments, consider using a job scheduler like Slurm (example scripts provided in the original bash files).