Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate GRPO vs. other RL algorithms #11

Open
2 tasks
gerred opened this issue Jan 25, 2025 · 8 comments
Open
2 tasks

Evaluate GRPO vs. other RL algorithms #11

gerred opened this issue Jan 25, 2025 · 8 comments

Comments

@gerred
Copy link

gerred commented Jan 25, 2025

Per: https://x.com/jiayi_pirate/status/1882839504899420517, even if GRPO is initially implemented, would this be a good repo to add evals and work for other RL algorithms? Based on where R1 landed and not having the time to come to a conclusion, it's interesting next work to pursue after replication.

  • PPO
  • PRIME
@lewtun
Copy link
Member

lewtun commented Jan 25, 2025

Hi @gerred yes we would be very happy to have contributions comparing GRPO to PPO and friends!

Doing so will require a few changes on the TRL side for the other RL trainers:

  • Adding custom reward functions as done here and here
  • Speeding up rollouts with vllm as done here

For evals we are using lighteval and for tasks that are not natively supported in the lib, we are adding them here

@gerred
Copy link
Author

gerred commented Jan 25, 2025

Thanks @lewtun! I was looking through lighteval, I've played with TRL but this is exactly what I wanted to dig into. Mind if I use this as a tracking issue for subsequent ones and keep this one open?

wip ppo reward funcs: https://github.com/gerred/open-r1
wip ppo trl: todo gerred add link
wip ppo lighteval/vllm: todo gerred add link

will add more for prime, and got a good suggestion to look at how kimi is operating as well with it's long CoT RL

@qgallouedec
Copy link
Member

Sounds good @gerred!

@gerred
Copy link
Author

gerred commented Jan 25, 2025

@qgallouedec @lewtun getting started with PPO first this morning. I'm imagining I'll match w/ PPO and PRIME the same reward functions and system prompt, adjusting signatures. from doing this in lighteval perspective, would it be better to refactor out a base class for the reward funcs and prompt themselves?

nm, answered it for myself while doing PPO!

@mariagrandury
Copy link
Contributor

Hi! I'm familiar with lighteval and would be happy to help with evaluation, @gerred let me know if there's something I can support you with

@gerred
Copy link
Author

gerred commented Jan 25, 2025

@mariagrandury Yes please! I am working through the top level and in TRL, getting to a base run. I'm taking some afk time from fighting a local NCCL issue I found so will be back in a few hours to spin up some instances, and I will get the branch up for open-r1. I'm weighting evenly right now for PPO between the two verifier funcs, but I am taking a very naive approach.

@mariagrandury
Copy link
Contributor

@gerred sounds good! Ping me, also maybe have a look at #55

@qsunyuan
Copy link

Any Updates on other RL Methods (e.g., PPO) based on Open-R1 Repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants