Skip to content

Commit

Permalink
Merge branch 'main' into python-3.13
Browse files Browse the repository at this point in the history
  • Loading branch information
qgallouedec authored Feb 18, 2025
2 parents e99013b + 15fec31 commit 9057793
Show file tree
Hide file tree
Showing 96 changed files with 3,819 additions and 850 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests_latest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
steps:
- name: Git checkout
uses: actions/checkout@v4
with: { ref: v0.13-release }
with: { ref: v0.15-release }
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
Expand Down
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ keywords:
- pytorch
- transformers
license: Apache-2.0
version: 0.13
version: 0.15
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ There are several ways you can contribute to TRL:
* Contribute to the examples or the documentation.

If you don't know where to start, there is a special [Good First
Issue](https://github.com/huggingface/trl/contribute) listing. It will give you a list of
Issue](https://github.com/huggingface/trl/labels/%F0%9F%91%B6%20good%20first%20issue) listing. It will give you a list of
open issues that are beginner-friendly and help you start contributing to open-source. The best way to do that is to open a Pull Request and link it to the issue that you'd like to work on. We try to give priority to opened PRs as we can easily track the progress of the fix, and if the contributor does not have time anymore, someone else can take the PR over.

For something slightly more challenging, you can also take a look at the [Good Second Issue](https://github.com/huggingface/trl/labels/Good%20Second%20Issue) list. In general though, if you feel like you know what you're doing, go for it and we'll help you get there! 🚀
Expand Down
39 changes: 13 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,39 +137,26 @@ trainer = RewardTrainer(
trainer.train()
```

### `RLOOTrainer`
### `GRPOTrainer`

`RLOOTrainer` implements a [REINFORCE-style optimization](https://huggingface.co/papers/2402.14740) for RLHF that is more performant and memory-efficient than PPO. Here is a basic example of how to use the `RLOOTrainer`:
`GRPOTrainer` implements the [Group Relative Policy Optimization (GRPO) algorithm](https://huggingface.co/papers/2402.03300) that is more memory-efficient than PPO and was used to train [Deepseek AI's R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).

```python
from trl import RLOOConfig, RLOOTrainer, apply_chat_template
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoModelForSequenceClassification,
AutoTokenizer,
)
from trl import GRPOConfig, GRPOTrainer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
reward_model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct", num_labels=1
)
ref_policy = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
policy = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/tldr", split="train")

dataset = load_dataset("trl-lib/ultrafeedback-prompt")
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
dataset = dataset.map(lambda x: tokenizer(x["prompt"]), remove_columns="prompt")
# Dummy reward function: rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
return [-abs(20 - len(completion)) for completion in completions]

training_args = RLOOConfig(output_dir="Qwen2.5-0.5B-RL")
trainer = RLOOTrainer(
config=training_args,
processing_class=tokenizer,
policy=policy,
ref_policy=ref_policy,
reward_model=reward_model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO", logging_steps=10)
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_len,
args=training_args,
train_dataset=dataset,
)
trainer.train()
```
Expand Down
2 changes: 1 addition & 1 deletion commands/run_sft.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ accelerate launch $EXTRA_ACCELERATE_ARGS \
--output_dir $OUTPUT_DIR \
--max_steps $MAX_STEPS \
--per_device_train_batch_size $BATCH_SIZE \
--max_seq_length $SEQ_LEN \
--max_length $SEQ_LEN \
$EXTRA_TRAINING_ARGS
"""

Expand Down
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@
title: Online DPO
- local: gkd_trainer
title: GKD
- local: grpo_trainer
title: GRPO
- local: kto_trainer
title: KTO
- local: nash_md_trainer
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The `alignprop.py` script is a working example of using the `AlignProp` trainer

**Note:** one A100 GPU is recommended to get this running. For lower memory setting, consider setting truncated_backprop_rand to False. With default settings this will do truncated backpropagation with K=1.

Almost every configuration parameter has a default. There is only one commandline flag argument that is required of the user to get things up and running. The user is expected to have a [huggingface user access token](https://huggingface.co/docs/hub/security-tokens) that will be used to upload the model post finetuning to HuggingFace hub. The following bash command is to be entered to get things running
Almost every configuration parameter has a default. There is only one commandline flag argument that is required of the user to get things up and running. The user is expected to have a [huggingface user access token](https://huggingface.co/docs/hub/security-tokens) that will be used to upload the model post-finetuning to HuggingFace hub. The following bash command is to be entered to get things running

```batch
python alignprop.py --hf_user_access_token <token>
Expand All @@ -26,7 +26,7 @@ To obtain the documentation of `stable_diffusion_tuning.py`, please run `python

The following are things to keep in mind (The code checks this for you as well) in general while configuring the trainer (beyond the use case of using the example script)

- The configurable randomized truncation range (`--alignprop_config.truncated_rand_backprop_minmax=(0,50)`) the first number should be equal and greater to 0, while the second number should equal or less to the number of diffusion timesteps (sample_num_steps)
- The configurable randomized truncation range (`--alignprop_config.truncated_rand_backprop_minmax=(0,50)`) the first number should be equal and greater than 0, while the second number should equal or less to the number of diffusion timesteps (sample_num_steps)
- The configurable truncation backprop absolute step (`--alignprop_config.truncated_backprop_timestep=49`) the number should be less than the number of diffusion timesteps (sample_num_steps), it only matters when truncated_backprop_rand is set to False

## Setting up the image logging hook function
Expand Down
4 changes: 2 additions & 2 deletions docs/source/bco_trainer.mdx → docs/source/bco_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ embedding_model = Accelerator().prepare_model(self.embedding_model)
embedding_func = partial(embed_prompt, model=embedding_model)
```

Set `prompt_sample_size` to defined how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:
Set `prompt_sample_size` to define how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:

```py
training_args = BCOConfig(
Expand Down Expand Up @@ -97,4 +97,4 @@ To scale how much the auxiliary loss contributes to the total loss, use the hype

## BCOConfig

[[autodoc]] BCOConfig
[[autodoc]] BCOConfig
2 changes: 1 addition & 1 deletion docs/source/best_of_n.mdx → docs/source/best_of_n.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,6 @@ best_of_n.generate(query_tensors, device=device)

```

Furthermore, at the time of initialization you can set the seed to control repeatability of the generation process and the number of samples to generate for each query
Furthermore, at the time of initialization you can set the seed to control the repeatability of the generation process and the number of samples to generate for each query


File renamed without changes.
3 changes: 2 additions & 1 deletion docs/source/clis.mdx → docs/source/clis.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@ Currently supported CLIs are:
#### Training commands

- `trl dpo`: fine-tune a LLM with DPO
- `trl grpo`: fine-tune a LLM with GRPO
- `trl kto`: fine-tune a LLM with KTO
- `trl sft`: fine-tune a LLM with SFT

#### Other commands

- `trl chat`: quickly spin up a LLM fine-tuned for chatting
- `trl chat`: quickly spin up an LLM fine-tuned for chatting
- `trl env`: get the system information

## Fine-tuning with the CLI
Expand Down
Loading

0 comments on commit 9057793

Please sign in to comment.