Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM 8xH100 using latest GRPO code with vLLM #2688

Open
5 tasks done
abacaj opened this issue Jan 30, 2025 · 6 comments
Open
5 tasks done

OOM 8xH100 using latest GRPO code with vLLM #2688

abacaj opened this issue Jan 30, 2025 · 6 comments
Labels
🐛 bug Something isn't working 🚀 deepspeed Related to deepspeed 🏋 GRPO Related to GRPO

Comments

@abacaj
Copy link

abacaj commented Jan 30, 2025

Reproduction

Model is 8B.

Works fine not using vLLM w/deepspeed, when enabling vLLM and using deepspeed I get oom on the vLLM device when the model is loading:

INFO 01-30 05:50:12 model_runner.py:1115] Loading model weights took 0.0000 GB
INFO 01-30 05:50:16 worker.py:266] Memory profiling takes 3.26 seconds
INFO 01-30 05:50:16 worker.py:266] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 01-30 05:50:16 worker.py:266] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 71.19GiB.
INFO 01-30 05:50:16 executor_base.py:108] # CUDA blocks: 39874, # CPU blocks: 2240
INFO 01-30 05:50:16 executor_base.py:113] Maximum concurrency for 131072 tokens per request: 4.87x
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ab/grpo/train.py", line 173, in <module>
[rank0]:     trainer = GRPOTrainer(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 314, in __init__
[rank0]:     self.llm = LLM(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/utils.py", line 1039, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 482, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 274, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 427, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 119, in initialize_cache
[rank0]:     self.collective_rpc("initialize_cache",
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/utils.py", line 2208, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 308, in initialize_cache
[rank0]:     self._init_cache_engine()
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 313, in _init_cache_engine
[rank0]:     self.cache_engine = [
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 314, in <listcomp>
[rank0]:     CacheEngine(self.cache_config, self.model_config,
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 62, in __init__
[rank0]:     self.gpu_cache = self._allocate_kv_cache(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 81, in _allocate_kv_cache
[rank0]:     torch.zeros(kv_cache_shape,
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.74 GiB. GPU 6 has a total capacity of 79.10 GiB of which 961.88 MiB is free. Including non-PyTorch memory, this process has 78.15 GiB memory in use. Of the allocated memory 77.42 GiB is allocated by PyTorch, and 74.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

System Info

- Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version: 2.5.1
- CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
- Transformers version: 4.48.1
- Accelerate version: 1.3.0
- Accelerate config: 
  - compute_environment: LOCAL_MACHINE
  - distributed_type: DEEPSPEED
  - use_cpu: False
  - debug: False
  - num_processes: 7
  - machine_rank: 0
  - num_machines: 1
  - rdzv_backend: static
  - same_network: True
  - main_training_function: main
  - enable_cpu_affinity: False
  - deepspeed_config: {'deepspeed_config_file': 'configs/deepspeed.json', 'zero3_init_flag': False}
  - downcast_bf16: no
  - tpu_use_cluster: False
  - tpu_use_sudo: False
  - tpu_env: []
- Datasets version: 3.2.0
- HF Hub version: 0.28.0
- TRL version: 0.14.0.dev0
- bitsandbytes version: not installed
- DeepSpeed version: 0.16.3
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.60.2
- PEFT version: 0.9.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@github-actions github-actions bot added 🏋 GRPO Related to GRPO 🚀 deepspeed Related to deepspeed 🐛 bug Something isn't working labels Jan 30, 2025
@kpfoley
Copy link

kpfoley commented Jan 30, 2025

I'm getting the same error on a 4xL4 (4 x 24 GB CUDA RAM) instance with Llama3.2-1b. VLLM loads this model on one GPU device (24gb) with no problems bur it fails with the same OOM error as above when called with accelerate.

@lewtun
Copy link
Member

lewtun commented Jan 30, 2025

Hey @abacaj ! Can you please share a gist / command that reproduces the error (easier for us to debug then)

@qgallouedec
Copy link
Member

Try to reduce vllm_gpu_memory_utilization (default 0.9):

from trl import GRPOConfig

GRPOConfig(..., vllm_gpu_memory_utilization=0.7)

@imrankh46
Copy link

@qgallouedec
Why you are using reward model here ?

from datasets import load_dataset
from peft import LoraConfig
from trl import GRPOConfig, GRPOTrainer

# Load the dataset
dataset = load_dataset("trl-lib/tldr", split="train")

training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO",
    learning_rate=1e-5,
    logging_steps=10,
    gradient_accumulation_steps=16,
    max_completion_length=128,
)
trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_model="weqweasdas/RM-Gemma-2B",
    args=training_args,
    train_dataset=dataset,
    peft_config=LoraConfig(task_type="CAUSAL_LM"),
)

trainer.train()

@imrankh46
Copy link

I mean we can use any reward model or we will fine tune special for specific domain?

@qgallouedec
Copy link
Member

Why you are using reward model here ?

It's probably an old code from the time when GRPO was not compatible with reward functions. Is this code still accessible somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working 🚀 deepspeed Related to deepspeed 🏋 GRPO Related to GRPO
Projects
None yet
Development

No branches or pull requests

5 participants