- Azure Account (with free $200 credit for new accounts)
- Training data (
jake_training.json
) - Azure ML Workspace
- Go to Azure Portal (portal.azure.com)
- Create new "Machine Learning" resource
- Select your subscription and resource group
- Choose a workspace name (e.g., "jake-ai-workspace")
- Select your region (choose one with GPU availability)
For Yi-34B, we recommend:
- VM Size: NC24ads_A100_v4 (1x A100 80GB GPU)
- This will easily handle Yi-34B with room for training
- Costs ~$3.60/hour but training will be much faster
- Alternative: NC6s_v3 (1x V100 16GB GPU)
- More cost-effective (~$0.90/hour)
- Will require more optimization but still works
jake-ai/
├── src/
│ ├── train.py # Training script
│ └── utils.py # Helper functions
├── data/
│ └── training/
│ └── jake_training.json
├── config/
│ └── training_config.yaml
└── README.md
Create a new compute instance with these specs:
- Python 3.10
- PyTorch 2.1.0+cu118
- Required packages in requirements.txt
- Go to "Data" in Azure ML Studio
- Create new "Data Asset"
- Upload your training data
- Register as "jake-training-data"
Use these settings in your YAML config:
compute:
instance_type: "NC24ads_A100_v4"
training:
batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 1e-4
num_epochs: 3
model:
name: "01-ai/Yi-34B"
quantization: "4bit"
lora_r: 64
lora_alpha: 128
- Use Azure ML Studio's monitoring dashboard
- Track:
- Training loss
- GPU utilization
- Memory usage
- Training progress
-
Estimated Costs
- A100: ~$3.60/hour
- V100: ~$0.90/hour
- Storage: Minimal (~$0.08/GB/month)
-
Cost Optimization
- Use spot instances when possible (up to 90% cheaper)
- Delete compute instance when not in use
- Clean up unused storage
- Monitor usage with Azure Cost Management
- Store credentials securely using Azure Key Vault
- Use managed identities for authentication
- Set up proper access controls
- Enable early stopping to prevent wasted compute
- Use checkpointing every 30 minutes
- Monitor training metrics
- Keep logs for troubleshooting
-
Out of Memory
- Reduce batch size
- Increase gradient accumulation
- Check GPU memory usage
-
Training Issues
- Check logs in Azure ML Studio
- Monitor GPU utilization
- Verify data loading
-
Cost Overruns
- Set up budget alerts
- Monitor usage regularly
- Use auto-shutdown policies
- Set up Azure ML Workspace
- Configure compute instance
- Upload training data
- Run training script
- Monitor progress and costs
Need help with any specific step? Let me know!