-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPQA Diamond and fix evaluation deps #196
Changes from 11 commits
20a3229
0d43221
e4acb4b
b11bbe8
107da00
8dc4c91
cc10a80
9fdcc7e
665af3b
3c88f5e
c624fd4
9f3d1df
2f84345
abe7989
4566e00
78ac6a8
7b5b322
0dc6320
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,17 +56,17 @@ uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --u | |
Next, install vLLM: | ||
|
||
```shell | ||
uv pip install vllm>=0.7.0 | ||
uv pip install vllm==0.7.1 | ||
|
||
# For CUDA 12.1 | ||
pip install vllm>=0.7.0 --extra-index-url https://download.pytorch.org/whl/cu121 | ||
uv pip install vllm==0.7.1 --extra-index-url https://download.pytorch.org/whl/cu121 --index-strategy unsafe-best-match | ||
export LD_LIBRARY_PATH=$(python -c "import site; print(site.getsitepackages()[0] + '/nvidia/nvjitlink/lib')"):$LD_LIBRARY_PATH | ||
``` | ||
|
||
This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend: | ||
|
||
```shell | ||
pip install -e ".[dev]" | ||
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]" --link-mode=copy | ||
``` | ||
|
||
Next, log into your Hugging Face and Weights and Biases accounts as follows: | ||
|
@@ -134,16 +134,33 @@ We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1 | |
```shell | ||
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
TASK=aime24 | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
# AIME 2024 | ||
TASK=aime24 | ||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
|
||
# MATH-500 | ||
TASK=math_500 | ||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
|
||
# GPQA Diamond | ||
TASK=gpqa:diamond | ||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed for the DeepSeek models (gives ~1 point gain if included) |
||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
> [!IMPORTANT] | ||
> You must set `max_model_length=32768` in the `vllm` command to align with the `generation_size` we define per eval. Without this, `lighteval` will throw an error. | ||
|
||
To increase throughput across multiple GPUs, use _data parallel_ as follows: | ||
|
||
```shell | ||
|
@@ -156,7 +173,6 @@ OUTPUT_DIR=data/evals/$MODEL | |
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
|
@@ -173,50 +189,94 @@ export VLLM_WORKER_MULTIPROC_METHOD=spawn | |
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
You can also launch an evaluation with `make evaluate`, specifying the model, task, and optionally the parallelism technique and number of GPUs. | ||
|
||
To evaluate on a single GPU: | ||
|
||
```shell | ||
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 | ||
``` | ||
|
||
To use Data Parallelism: | ||
|
||
```shell | ||
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8 | ||
``` | ||
|
||
To use Tensor Parallelism: | ||
|
||
```shell | ||
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8 | ||
``` | ||
## Reproducing Deepseek's evaluation results on MATH-500 | ||
We are able to reproduce Deepseek's reported results on the MATH-500 Benchmark: | ||
| Model | MATH-500 (HF lighteval) | MATH-500 (DeepSeek Reported) | | ||
| :-------------------------- | :-------: | :----------------------------: | | ||
| DeepSeek-R1-Distill-Qwen-1.5B | 81.6 | 83.9 | | ||
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 | | ||
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 | | ||
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 | | ||
| DeepSeek-R1-Distill-Llama-8B | 85.8 | 89.1 | | ||
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 | | ||
|
||
## Reproducing Deepseek's evaluation results | ||
|
||
### MATH-500 | ||
|
||
We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1 standard deviation: | ||
|
||
| Model | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) | | ||
|:------------------------------|:-----------------------:|:----------------------------:| | ||
| DeepSeek-R1-Distill-Qwen-1.5B | 81.8 | 83.9 | | ||
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 | | ||
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 | | ||
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 | | ||
| DeepSeek-R1-Distill-Llama-8B | 85.8 | 89.1 | | ||
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 | | ||
|
||
To reproduce these results use the following command: | ||
|
||
```shell | ||
NUM_GPUS=8 | ||
MODEL=deepseek-ai/{model_name} | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS" | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
lighteval vllm $MODEL_ARGS "custom|math_500|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
Alternatively, you can launch Slurm jobs as follows: | ||
|
||
```shell | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-7B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-14B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-32B math_500 tp | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-8B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-70B math_500 tp | ||
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks math_500 | ||
``` | ||
|
||
### GPQA Diamond | ||
|
||
We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1 standard deviation: | ||
|
||
| Model | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) | | ||
|:------------------------------|:---------------------------:|:--------------------------------:| | ||
| DeepSeek-R1-Distill-Qwen-1.5B | 33.33 | 33.8 | | ||
| DeepSeek-R1-Distill-Qwen-7B | 48.48 | 49.1 | | ||
| DeepSeek-R1-Distill-Qwen-14B | 55.56 | 59.1 | | ||
| DeepSeek-R1-Distill-Qwen-32B | 58.59 | 62.1 | | ||
| DeepSeek-R1-Distill-Llama-8B | 51.01 | 49.0 | | ||
| DeepSeek-R1-Distill-Llama-70B | x | 65.2 | | ||
|
||
To reproduce these results use the following command: | ||
|
||
```shell | ||
NUM_GPUS=8 | ||
MODEL=deepseek-ai/{model_name} | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS" | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
lighteval vllm $MODEL_ARGS "custom|gpqa:diamond|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
```shell | ||
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks gpqa | ||
``` | ||
|
||
## Data generation | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,17 +53,17 @@ | |
"huggingface-hub[cli]>=0.19.2,<1.0", | ||
"isort>=5.12.0", | ||
"liger_kernel==0.5.2", | ||
"lighteval @ git+https://github.com/huggingface/lighteval.git@0e462692436e1f0575bdb4c6ef63453ad9bde7d4#egg=lighteval[math]", | ||
"math-verify>=0.3.3", # Used for math verification in grpo | ||
"lighteval @ git+https://github.com/huggingface/lighteval.git@3c9b0c9dde6718b23ef5b0f4960355f0d494bdfc#egg=lighteval[math]", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bump to latest commit once vllm fix for DDP is merged: huggingface/lighteval#541 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done it's 86f62259f105ae164f655e0b91c92a823a742724 |
||
"math-verify==0.5.2", # Used for math verification in grpo | ||
"packaging>=23.0", | ||
"parameterized>=0.9.0", | ||
"pytest", | ||
"safetensors>=0.3.3", | ||
"sentencepiece>=0.1.99", | ||
"torch>=2.5.1", | ||
"torch==2.5.1", | ||
"transformers @ git+https://github.com/huggingface/transformers.git@main", | ||
"trl @ git+https://github.com/huggingface/trl.git@main", | ||
"vllm>=0.7.1", | ||
"vllm==0.7.1", | ||
"wandb>=0.19.1", | ||
] | ||
|
||
|
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,55 +1,78 @@ | ||
#!/bin/bash | ||
#SBATCH --job-name=open-r1-evaluate | ||
#SBATCH --nodes=1 | ||
#SBATCH --ntasks-per-node=1 | ||
#SBATCH --exclusive | ||
#SBATCH --gres=gpu:8 | ||
#SBATCH --partition=hopper-prod | ||
#SBATCH --time=01:59:00 | ||
#SBATCH --output=./logs/evaluate/%x-%j.out | ||
#SBATCH --err=./logs/evaluate/%x-%j.err | ||
|
||
# Usage: sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B aime24 | ||
#SBATCH --partition=hopper-prod | ||
#SBATCH --output=./logs/%x-%j.out | ||
#SBATCH --err=./logs/%x-%j.err | ||
#SBATCH --requeue | ||
|
||
set -x -e | ||
|
||
source ~/.bashrc | ||
source openr1/bin/activate | ||
module load cuda/12.1 | ||
echo "START TIME: $(date)" | ||
echo "PYTHON ENV: $(which python)" | ||
|
||
TASK_NAME=$1 | ||
TASKS=$2 | ||
MODEL_ID=$3 | ||
MODEL_REVISION=$4 | ||
# Optional args | ||
[ -z "$5"] && TENSOR_PARALLEL=False || TENSOR_PARALLEL=$5 | ||
[ -z "$6"] && TRUST_REMOTE_CODE=False || TRUST_REMOTE_CODE=$6 | ||
# $7 is reserved for system_prompt, see line 51 | ||
NUM_GPUS=$(nvidia-smi -L | wc -l) | ||
|
||
NUM_GPUS=8 | ||
MODEL=$1 | ||
TASK=$2 | ||
# Check if a third argument is passed, if it is tp then eval with tensor parallelism. Required for larger models | ||
if [ -n "$3" ] && [ "$3" == "tp" ]; then | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
# Set Whether to use tensor parallelism or data parallelism | ||
if [ "$TENSOR_PARALLEL" = "True" ]; then | ||
# use TP to shard model across NUM_GPUS | ||
export VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
MODEL_ARGS="pretrained=$MODEL_ID,revision=$MODEL_REVISION,trust_remote_code=$TRUST_REMOTE_CODE,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
edbeeching marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
fi | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
# TODO: restore data parallelism once lighteval fixed: | ||
# MODEL_ARGS="pretrained=$MODEL_ID,revision=$MODEL_REVISION,trust_remote_code=$TRUST_REMOTE_CODE,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
MODEL_ARGS="pretrained=$MODEL_ID,revision=$MODEL_REVISION,trust_remote_code=$TRUST_REMOTE_CODE,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
|
||
# force crashing on nccl issues like hanging broadcast | ||
export NCCL_ASYNC_ERROR_HANDLING=1 | ||
# export NCCL_DEBUG=INFO | ||
# export NCCL_DEBUG_SUBSYS=COLL | ||
# export NCCL_SOCKET_NTHREADS=1 | ||
# export NCCL_NSOCKS_PERTHREAD=1 | ||
# export CUDA_LAUNCH_BLOCKING=1 | ||
fi | ||
|
||
# Specific configuration optimized for the Hugging Face Compute Cluster | ||
# Be ye warned this may not work on other clusters! | ||
module load cuda/12.1 | ||
LM_EVAL_REPO_ID="open-r1/open-r1-eval-leaderboard" | ||
MODEL_NAME=$(echo $MODEL_ID | sed 's/\//_/g') # replaces / with _ | ||
DETAILS_REPO_ID="open-r1/details-$MODEL_NAME" | ||
OUTPUT_DIR="eval_results/$MODEL_ID/$MODEL_REVISION/$TASK_NAME" | ||
# We need this flag since we run this script from training jobs that use DeepSpeed and the env vars get progated which causes errors during evaluation | ||
ACCELERATE_USE_DEEPSPEED=false | ||
# Enable fast downloads | ||
HF_HUB_ENABLE_HF_TRANSFER=1 | ||
|
||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
echo "Running lighteval script ..." | ||
echo "Eval results will be saved to $OUTPUT_DIR" | ||
# Check if "custom" is a substring of TASKS | ||
if [[ $TASKS == *"custom"* ]]; then | ||
echo "Custom task detected. Running custom task evaluation script ..." | ||
lighteval vllm $MODEL_ARGS $TASKS \ | ||
--custom-tasks "src/open_r1/evaluate.py" \ | ||
--use-chat-template \ | ||
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \ | ||
--output-dir $OUTPUT_DIR \ | ||
--save-details \ | ||
--output-dir $OUTPUT_DIR | ||
${7:+--system-prompt "$7"} | ||
else | ||
lighteval vllm $MODEL_ARGS $TASKS \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR \ | ||
--save-details \ | ||
${7:+--system-prompt "$7"} | ||
fi | ||
|
||
OUTPUT_FILEPATHS=$(find $OUTPUT_DIR/results/ -type f \( -name "*.json" \)) | ||
for filepath in $OUTPUT_FILEPATHS; do | ||
echo "Uploading $filepath to Hugging Face Hub..." | ||
filename=$(basename -- "$filepath") | ||
huggingface-cli upload --repo-type space --private $LM_EVAL_REPO_ID $filepath $OUTPUT_DIR/$filename | ||
done | ||
|
||
echo "Uploading details to Hugging Face Hub..." | ||
DETAILS_FILEPATHS=$(find $OUTPUT_DIR/details/ -type f \( -name "*.parquet" \)) | ||
echo "DETAILS_FILEPATHS: $DETAILS_FILEPATHS" | ||
TIMESTAMP=$(date +"%Y-%m-%dT%H-%M-%S") | ||
python src/open_r1/utils/upload_details.py --data_files $DETAILS_FILEPATHS --hub_repo_id $DETAILS_REPO_ID --config_name $MODEL_REVISION.$TASK_NAME.$TIMESTAMP | ||
|
||
echo "Cleaning up ..." | ||
rm -rf $OUTPUT_DIR | ||
|
||
echo "END TIME: $(date)" | ||
echo "Done!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed because
uv
cannot installlighteval
otherwise due to some LFS file conflictThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had this issue, I had reverted back to pip, glad you fixed it.