From 65c44d89700029049a03602b244939cc8b5dd7b8 Mon Sep 17 00:00:00 2001 From: Lewis Tunstall Date: Wed, 19 Feb 2025 09:50:51 +0000 Subject: [PATCH] doc --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 40fd8c96..5f82d72e 100644 --- a/README.md +++ b/README.md @@ -170,12 +170,12 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con Our final [model](https://huggingface.co/Dongwei/Qwen-2.5-7B_Base_Math_smalllr), while using different learning rates, loss functions and reward structures, achieves 69.4% accuracy on MATH-500, demonstrating a 17%+ improvement over the base model. -#### Training with a code interpreter +#### 👨‍💻 Training with a code interpreter We provide a `code` reward function for executing code generated by the policy during training. Currently, this reward function targets code contests like [Codeforces](https://codeforces.com), where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we use [E2B](https://e2b.dev) sandboxes, which are fast and cheap to run. To use this reward function, first install the necessary dependencies: ```shell -uv pip install -e '.[code] +uv pip install -e '.[code]' ``` Then create a `.env` file and place an API token from E2B within it: