- Google account with access to Google Colab
- Your training data (
jake_training.json
) - Google Drive with at least 20GB free space
-
Create Google Drive Structure
jake-ai/ ├── data/ │ └── training/ │ └── jake_training.json ├── checkpoints/ └── output/
-
Colab Setup
- Open Google Colab
- Create a new notebook
- Connect to a GPU runtime (Runtime > Change runtime type > GPU)
- Mount Google Drive
-
Install Required Packages
!pip install -q torch==2.1.0 transformers==4.35.0 accelerate==0.24.0 !pip install -q bitsandbytes==0.41.1 scipy !pip install -q git+https://github.com/huggingface/peft.git
-
Mount Drive & Setup
from google.colab import drive drive.mount('/content/drive') # Copy training script !mkdir -p /content/jake-ai !cp "/content/drive/MyDrive/jake-ai/jake_finetune_yi.py" /content/jake-ai/
-
Import and Run Training
import sys sys.path.append('/content/jake-ai') from jake_finetune_yi import * # Setup model model, tokenizer = setup_model_and_tokenizer() # Load dataset dataset = prepare_dataset("/content/drive/MyDrive/jake-ai/data/training/jake_training.json") # Setup LoRA lora_config = setup_lora_config() model = prepare_model_for_kbit_training(model) model = get_peft_model(model, lora_config) # Setup training training_args = setup_training_arguments("/content/jake-ai-checkpoints") trainer = create_jake_trainer(model, tokenizer, dataset, training_args) # Start training trainer.train() # Save final model trainer.save_model("/content/drive/MyDrive/jake-ai/output/jake-yi-final")
-
Test Generation
# Load the trained model story = generate_jake_story(model, tokenizer) print(story)
-
Memory Management
- The script is optimized for Colab's T4 GPU
- Uses 4-bit quantization and gradient checkpointing
- Small batch size with gradient accumulation
-
Checkpointing
- Models are saved every 50 steps
- Only keeps last 3 checkpoints to save space
- Final model saved to Drive
-
Disconnection Handling
- Regular checkpoints allow resuming training
- Keep Colab tab active to prevent disconnections
- Use "Connect to hosted runtime" if disconnected
-
Monitoring
- Watch GPU memory usage with
nvidia-smi
- Check training progress in output logs
- Monitor Drive space during training
- Watch GPU memory usage with
-
Out of Memory
- Reduce batch size
- Increase gradient accumulation steps
- Clear notebook runtime and restart
-
Drive Space
- Clean up old checkpoints
- Monitor space usage
- Keep at least 10GB free
-
Slow Training
- Check GPU allocation
- Ensure no other notebooks using GPU
- Close unnecessary browser tabs