Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming from checkpoint doesn't seem to work #2657

Open
5 tasks done
Superskyyy opened this issue Jan 25, 2025 · 2 comments
Open
5 tasks done

Resuming from checkpoint doesn't seem to work #2657

Superskyyy opened this issue Jan 25, 2025 · 2 comments
Labels
🐛 bug Something isn't working ⏳ needs more info Additional information or clarification is required to proceed 🏋 PPO Related to PPO 🏋 RLOO Related to RLOO

Comments

@Superskyyy
Copy link
Contributor

Superskyyy commented Jan 25, 2025

Reproduction

I've been using PPO/RLOOTrainer and it seems the resume_from_checkpoint doens't work, upon seeing the code I was surprised that apparently? nothing implemented the checkpointing loading mechanism not even using the one from huggingface transfomers, (the trainer.train method doesn't take a resume_from_checkpoint arg)

How can I load back the checkpoint and resume the training? I assume people have been using this feature over the past and I somehow missed the guide to do so. I'm currently sitting on the checkpoint but don't know how to use it :)

The TrainerConfig object does take a "resume_from_checkpoint" arg but that really does nothing except passing it to the hfargsparser. The trainer.train method doesn't have any parameters unlike the transformers lib.

Any help would be appreciated! Thanks.

System Info

  • Platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.36
  • Python version: 3.12.8
  • PyTorch version: 2.5.1
  • CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
  • Transformers version: 4.48.0
  • Accelerate version: 0.34.2
  • Accelerate config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: DEEPSPEED
    • use_cpu: False
    • debug: False
    • num_processes: 8
    • machine_rank: 0
    • num_machines: 1
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • enable_cpu_affinity: False
    • deepspeed_config: {'deepspeed_config_file': 'deepspeed_config.json', 'zero3_init_flag': False}
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []
  • Datasets version: 3.2.0
  • HF Hub version: 0.27.1
  • TRL version: 0.14.0.dev0
  • bitsandbytes version: not installed
  • DeepSpeed version: 0.15.4
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: not installed
  • PEFT version: not installed

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@github-actions github-actions bot added 🏋 RLOO Related to RLOO 🏋 PPO Related to PPO 🐛 bug Something isn't working ⏳ needs more info Additional information or clarification is required to proceed labels Jan 25, 2025
@Superskyyy
Copy link
Contributor Author

Superskyyy commented Jan 26, 2025

@qgallouedec Please give some insight to how to recover from the checkpoint and resuming the optimizer/data steps, as this is blocking my training, if it needs some implementation i will contribute back to the repo. Thanks!

@Superskyyy
Copy link
Contributor Author

I have a very rough implementation following the transformers trainers design and seems to be working. Though I'm not sure why seems like all the trainers in TRL didn't support resuming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working ⏳ needs more info Additional information or clarification is required to proceed 🏋 PPO Related to PPO 🏋 RLOO Related to RLOO
Projects
None yet
Development

No branches or pull requests

1 participant