Resuming from checkpoint doesn't seem to work #2657
Labels
🐛 bug
Something isn't working
⏳ needs more info
Additional information or clarification is required to proceed
🏋 PPO
Related to PPO
🏋 RLOO
Related to RLOO
Reproduction
I've been using PPO/RLOOTrainer and it seems the resume_from_checkpoint doens't work, upon seeing the code I was surprised that apparently? nothing implemented the checkpointing loading mechanism not even using the one from huggingface transfomers, (the trainer.train method doesn't take a resume_from_checkpoint arg)
How can I load back the checkpoint and resume the training? I assume people have been using this feature over the past and I somehow missed the guide to do so. I'm currently sitting on the checkpoint but don't know how to use it :)
The TrainerConfig object does take a "resume_from_checkpoint" arg but that really does nothing except passing it to the hfargsparser. The trainer.train method doesn't have any parameters unlike the transformers lib.
Any help would be appreciated! Thanks.
System Info
Checklist
The text was updated successfully, but these errors were encountered: