Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPQA Diamond and fix evaluation deps #196

Merged
merged 18 commits into from
Feb 6, 2025
106 changes: 83 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,17 +56,17 @@ uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --u
Next, install vLLM:

```shell
uv pip install vllm>=0.7.0
uv pip install vllm==0.7.1

# For CUDA 12.1
pip install vllm>=0.7.0 --extra-index-url https://download.pytorch.org/whl/cu121
uv pip install vllm==0.7.1 --extra-index-url https://download.pytorch.org/whl/cu121 --index-strategy unsafe-best-match
export LD_LIBRARY_PATH=$(python -c "import site; print(site.getsitepackages()[0] + '/nvidia/nvjitlink/lib')"):$LD_LIBRARY_PATH
```

This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:

```shell
pip install -e ".[dev]"
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]" --link-mode=copy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed because uv cannot install lighteval otherwise due to some LFS file conflict

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I had this issue, I had reverted back to pip, glad you fixed it.

```

Next, log into your Hugging Face and Weights and Biases accounts as follows:
Expand Down Expand Up @@ -142,16 +142,33 @@ We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1
```shell
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

# AIME 2024
TASK=aime24
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir $OUTPUT_DIR

# MATH-500
TASK=math_500
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir $OUTPUT_DIR

# GPQA Diamond
TASK=gpqa:diamond
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for the DeepSeek models (gives ~1 point gain if included)

--output-dir $OUTPUT_DIR
```

> [!IMPORTANT]
> You must set `max_model_length=32768` in the `vllm` command to align with the `generation_size` we define per eval. Without this, `lighteval` will throw an error.

To increase throughput across multiple GPUs, use _data parallel_ as follows:

```shell
Expand All @@ -164,7 +181,6 @@ OUTPUT_DIR=data/evals/$MODEL
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
--output-dir $OUTPUT_DIR
```

Expand All @@ -181,50 +197,94 @@ export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
--output-dir $OUTPUT_DIR
```

You can also launch an evaluation with `make evaluate`, specifying the model, task, and optionally the parallelism technique and number of GPUs.

To evaluate on a single GPU:

```shell
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24
```

To use Data Parallelism:

```shell
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8
```

To use Tensor Parallelism:

```shell
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8
```
## Reproducing Deepseek's evaluation results on MATH-500
We are able to reproduce Deepseek's reported results on the MATH-500 Benchmark:
| Model | MATH-500 (HF lighteval) | MATH-500 (DeepSeek Reported) |
| :-------------------------- | :-------: | :----------------------------: |
| DeepSeek-R1-Distill-Qwen-1.5B | 81.6 | 83.9 |
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 |
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 |
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 |
| DeepSeek-R1-Distill-Llama-8B | 85.8 | 89.1 |
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 |

## Reproducing Deepseek's evaluation results

### MATH-500

We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1 standard deviation:

| Model | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |
|:------------------------------|:-----------------------:|:----------------------------:|
| DeepSeek-R1-Distill-Qwen-1.5B | 81.8 | 83.9 |
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 |
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 |
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 |
| DeepSeek-R1-Distill-Llama-8B | 85.8 | 89.1 |
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 |

To reproduce these results use the following command:

```shell
NUM_GPUS=8
MODEL=deepseek-ai/{model_name}
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS"
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|math_500|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir $OUTPUT_DIR
```

Alternatively, you can launch Slurm jobs as follows:

```shell
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B math_500
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-7B math_500
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-14B math_500
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-32B math_500 tp
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-8B math_500
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-70B math_500 tp
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks math_500
```

### GPQA Diamond

We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1 standard deviation:

| Model | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
|:------------------------------|:---------------------------:|:--------------------------------:|
| DeepSeek-R1-Distill-Qwen-1.5B | 33.33 | 33.8 |
| DeepSeek-R1-Distill-Qwen-7B | 48.48 | 49.1 |
| DeepSeek-R1-Distill-Qwen-14B | 55.56 | 59.1 |
| DeepSeek-R1-Distill-Qwen-32B | 58.59 | 62.1 |
| DeepSeek-R1-Distill-Llama-8B | 51.01 | 49.0 |
| DeepSeek-R1-Distill-Llama-70B | 65.15 | 65.2 |

To reproduce these results use the following command:

```shell
NUM_GPUS=8
MODEL=deepseek-ai/{model_name}
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS"
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|gpqa:diamond|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir $OUTPUT_DIR
```

```shell
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks gpqa
```

## Data generation

Expand Down
Empty file added logs/.gitkeep
Empty file.
8 changes: 4 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,17 +53,17 @@
"huggingface-hub[cli]>=0.19.2,<1.0",
"isort>=5.12.0",
"liger_kernel==0.5.2",
"lighteval @ git+https://github.com/huggingface/lighteval.git@0e462692436e1f0575bdb4c6ef63453ad9bde7d4#egg=lighteval[math]",
"math-verify>=0.3.3", # Used for math verification in grpo
"lighteval @ git+https://github.com/huggingface/lighteval.git@3c9b0c9dde6718b23ef5b0f4960355f0d494bdfc#egg=lighteval[math]",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump to latest commit once vllm fix for DDP is merged: huggingface/lighteval#541

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done it's 86f62259f105ae164f655e0b91c92a823a742724

"math-verify==0.5.2", # Used for math verification in grpo
"packaging>=23.0",
"parameterized>=0.9.0",
"pytest",
"safetensors>=0.3.3",
"sentencepiece>=0.1.99",
"torch>=2.5.1",
"torch==2.5.1",
"transformers @ git+https://github.com/huggingface/transformers.git@main",
"trl @ git+https://github.com/huggingface/trl.git@main",
"vllm>=0.7.1",
"vllm==0.7.1",
"wandb>=0.19.1",
]

Expand Down
75 changes: 0 additions & 75 deletions slurm/eval_callback.slurm

This file was deleted.

98 changes: 59 additions & 39 deletions slurm/evaluate.slurm
Original file line number Diff line number Diff line change
@@ -1,55 +1,75 @@
#!/bin/bash
#SBATCH --job-name=open-r1-evaluate
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gres=gpu:8
#SBATCH --partition=hopper-prod
#SBATCH --time=01:59:00
#SBATCH --output=./logs/evaluate/%x-%j.out
#SBATCH --err=./logs/evaluate/%x-%j.err

# Usage: sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B aime24
#SBATCH --partition=hopper-prod
#SBATCH --output=./logs/%x-%j.out
#SBATCH --err=./logs/%x-%j.err
#SBATCH --requeue

set -x -e

source ~/.bashrc
source openr1/bin/activate
module load cuda/12.1
echo "START TIME: $(date)"
echo "PYTHON ENV: $(which python)"

TASK_NAME=$1
TASKS=$2
MODEL_ID=$3
MODEL_REVISION=$4
# Optional args
[ -z "$5"] && TENSOR_PARALLEL=False || TENSOR_PARALLEL=$5
[ -z "$6"] && TRUST_REMOTE_CODE=False || TRUST_REMOTE_CODE=$6
# $7 is reserved for system_prompt, see line 51
NUM_GPUS=$(nvidia-smi -L | wc -l)

NUM_GPUS=8
MODEL=$1
TASK=$2
# Check if a third argument is passed, if it is tp then eval with tensor parallelism. Required for larger models
if [ -n "$3" ] && [ "$3" == "tp" ]; then
MODEL_ARGS="pretrained=$MODEL,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
# Set Whether to use tensor parallelism or data parallelism
if [ "$TENSOR_PARALLEL" = "True" ]; then
# use TP to shard model across NUM_GPUS
export VLLM_WORKER_MULTIPROC_METHOD=spawn
MODEL_ARGS="pretrained=$MODEL_ID,revision=$MODEL_REVISION,trust_remote_code=$TRUST_REMOTE_CODE,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
else
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
MODEL_ARGS="pretrained=$MODEL_ID,revision=$MODEL_REVISION,trust_remote_code=$TRUST_REMOTE_CODE,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
fi
OUTPUT_DIR=data/evals/$MODEL


# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1

# Specific configuration optimized for the Hugging Face Compute Cluster
# Be ye warned this may not work on other clusters!
module load cuda/12.1
LM_EVAL_REPO_ID="open-r1/open-r1-eval-leaderboard"
MODEL_NAME=$(echo $MODEL_ID | sed 's/\//_/g') # replaces / with _
DETAILS_REPO_ID="open-r1/details-$MODEL_NAME"
OUTPUT_DIR="eval_results/$MODEL_ID/$MODEL_REVISION/$TASK_NAME"
# We need this flag since we run this script from training jobs that use DeepSpeed and the env vars get progated which causes errors during evaluation
ACCELERATE_USE_DEEPSPEED=false
# Enable fast downloads
HF_HUB_ENABLE_HF_TRANSFER=1

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
--custom-tasks src/open_r1/evaluate.py \
echo "Running lighteval script ..."
echo "Eval results will be saved to $OUTPUT_DIR"
# Check if "custom" is a substring of TASKS
if [[ $TASKS == *"custom"* ]]; then
echo "Custom task detected. Running custom task evaluation script ..."
lighteval vllm $MODEL_ARGS $TASKS \
--custom-tasks "src/open_r1/evaluate.py" \
--use-chat-template \
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
--output-dir $OUTPUT_DIR \
--save-details \
--output-dir $OUTPUT_DIR
${7:+--system-prompt "$7"}
else
lighteval vllm $MODEL_ARGS $TASKS \
--use-chat-template \
--output-dir $OUTPUT_DIR \
--save-details \
${7:+--system-prompt "$7"}
fi

OUTPUT_FILEPATHS=$(find $OUTPUT_DIR/results/ -type f \( -name "*.json" \))
for filepath in $OUTPUT_FILEPATHS; do
echo "Uploading $filepath to Hugging Face Hub..."
filename=$(basename -- "$filepath")
huggingface-cli upload --repo-type space --private $LM_EVAL_REPO_ID $filepath $OUTPUT_DIR/$filename
done

echo "Uploading details to Hugging Face Hub..."
DETAILS_FILEPATHS=$(find $OUTPUT_DIR/details/ -type f \( -name "*.parquet" \))
echo "DETAILS_FILEPATHS: $DETAILS_FILEPATHS"
TIMESTAMP=$(date +"%Y-%m-%dT%H-%M-%S")
python src/open_r1/utils/upload_details.py --data_files $DETAILS_FILEPATHS --hub_repo_id $DETAILS_REPO_ID --config_name $MODEL_REVISION.$TASK_NAME.$TIMESTAMP

echo "Cleaning up ..."
rm -rf $OUTPUT_DIR

echo "END TIME: $(date)"
echo "Done!"
Loading