Evaluate GRPO vs. other RL algorithms #11

gerred · 2025-01-25T04:25:47Z

Per: https://x.com/jiayi_pirate/status/1882839504899420517, even if GRPO is initially implemented, would this be a good repo to add evals and work for other RL algorithms? Based on where R1 landed and not having the time to come to a conclusion, it's interesting next work to pursue after replication.

PPO
PRIME

lewtun · 2025-01-25T07:29:36Z

Hi @gerred yes we would be very happy to have contributions comparing GRPO to PPO and friends!

Doing so will require a few changes on the TRL side for the other RL trainers:

Adding custom reward functions as done here and here
Speeding up rollouts with vllm as done here

For evals we are using lighteval and for tasks that are not natively supported in the lib, we are adding them here

gerred · 2025-01-25T07:44:18Z

Thanks @lewtun! I was looking through lighteval, I've played with TRL but this is exactly what I wanted to dig into. Mind if I use this as a tracking issue for subsequent ones and keep this one open?

wip ppo reward funcs: https://github.com/gerred/open-r1
wip ppo trl: todo gerred add link
wip ppo lighteval/vllm: todo gerred add link

will add more for prime, and got a good suggestion to look at how kimi is operating as well with it's long CoT RL

qgallouedec · 2025-01-25T07:45:41Z

Sounds good @gerred!

gerred · 2025-01-25T13:29:58Z

@qgallouedec @lewtun getting started with PPO first this morning. I'm imagining I'll match w/ PPO and PRIME the same reward functions and system prompt, adjusting signatures. from doing this in lighteval perspective, would it be better to refactor out a base class for the reward funcs and prompt themselves?

nm, answered it for myself while doing PPO!

mariagrandury · 2025-01-25T21:28:51Z

Hi! I'm familiar with lighteval and would be happy to help with evaluation, @gerred let me know if there's something I can support you with

gerred · 2025-01-25T21:35:42Z

@mariagrandury Yes please! I am working through the top level and in TRL, getting to a base run. I'm taking some afk time from fighting a local NCCL issue I found so will be back in a few hours to spin up some instances, and I will get the branch up for open-r1. I'm weighting evenly right now for PPO between the two verifier funcs, but I am taking a very naive approach.

mariagrandury · 2025-01-27T01:07:06Z

@gerred sounds good! Ping me, also maybe have a look at #55

qsunyuan · 2025-03-25T11:21:45Z

Any Updates on other RL Methods (e.g., PPO) based on Open-R1 Repo?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate GRPO vs. other RL algorithms #11

Evaluate GRPO vs. other RL algorithms #11

gerred commented Jan 25, 2025 •

edited

Loading

lewtun commented Jan 25, 2025

gerred commented Jan 25, 2025 •

edited

Loading

qgallouedec commented Jan 25, 2025

gerred commented Jan 25, 2025 •

edited

Loading

mariagrandury commented Jan 25, 2025

gerred commented Jan 25, 2025

mariagrandury commented Jan 27, 2025

qsunyuan commented Mar 25, 2025

Evaluate GRPO vs. other RL algorithms #11

Evaluate GRPO vs. other RL algorithms #11

Comments

gerred commented Jan 25, 2025 • edited Loading

lewtun commented Jan 25, 2025

gerred commented Jan 25, 2025 • edited Loading

qgallouedec commented Jan 25, 2025

gerred commented Jan 25, 2025 • edited Loading

mariagrandury commented Jan 25, 2025

gerred commented Jan 25, 2025

mariagrandury commented Jan 27, 2025

qsunyuan commented Mar 25, 2025

gerred commented Jan 25, 2025 •

edited

Loading

gerred commented Jan 25, 2025 •

edited

Loading

gerred commented Jan 25, 2025 •

edited

Loading