Resuming from checkpoint doesn't seem to work #2657

Superskyyy · 2025-01-25T15:22:28Z

Reproduction

I've been using PPO/RLOOTrainer and it seems the resume_from_checkpoint doens't work, upon seeing the code I was surprised that apparently? nothing implemented the checkpointing loading mechanism not even using the one from huggingface transfomers, (the trainer.train method doesn't take a resume_from_checkpoint arg)

How can I load back the checkpoint and resume the training? I assume people have been using this feature over the past and I somehow missed the guide to do so. I'm currently sitting on the checkpoint but don't know how to use it :）

The TrainerConfig object does take a "resume_from_checkpoint" arg but that really does nothing except passing it to the hfargsparser. The trainer.train method doesn't have any parameters unlike the transformers lib.

Any help would be appreciated! Thanks.

System Info

Platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.36
Python version: 3.12.8
PyTorch version: 2.5.1
CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
Transformers version: 4.48.0
Accelerate version: 0.34.2
Accelerate config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'deepspeed_config_file': 'deepspeed_config.json', 'zero3_init_flag': False}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Datasets version: 3.2.0
HF Hub version: 0.27.1
TRL version: 0.14.0.dev0
bitsandbytes version: not installed
DeepSpeed version: 0.15.4
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: not installed
PEFT version: not installed

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

Superskyyy · 2025-01-26T03:26:18Z

@qgallouedec Please give some insight to how to recover from the checkpoint and resuming the optimizer/data steps, as this is blocking my training, if it needs some implementation i will contribute back to the repo. Thanks!

Superskyyy · 2025-01-27T04:54:47Z

I have a very rough implementation following the transformers trainers design and seems to be working. Though I'm not sure why seems like all the trainers in TRL didn't support resuming.

github-actions bot added 🏋 RLOO Related to RLOO 🏋 PPO Related to PPO 🐛 bug Something isn't working ⏳ needs more info Additional information or clarification is required to proceed labels Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming from checkpoint doesn't seem to work #2657

Resuming from checkpoint doesn't seem to work #2657

Superskyyy commented Jan 25, 2025 •

edited

Loading

Superskyyy commented Jan 26, 2025 •

edited

Loading

Superskyyy commented Jan 27, 2025

Resuming from checkpoint doesn't seem to work #2657

Resuming from checkpoint doesn't seem to work #2657

Comments

Superskyyy commented Jan 25, 2025 • edited Loading

Reproduction

System Info

Checklist

Superskyyy commented Jan 26, 2025 • edited Loading

Superskyyy commented Jan 27, 2025

Superskyyy commented Jan 25, 2025 •

edited

Loading

Superskyyy commented Jan 26, 2025 •

edited

Loading