This guide explains how to run various types of machine learning jobs on our Slurm cluster setup. It provides examples of single-node training, hyperparameter search using job arrays, and distributed training across multiple nodes.
.
├── ml_jobs/
│ ├── scripts/ # ML training scripts
│ │ ├── train_classifier.py # Single-node basic classifier
│ │ ├── param_search.py # Hyperparameter search (array job)
│ │ └── distributed_train.py # Multi-node distributed training
│ └── submit/ # Slurm submission scripts
│ ├── submit_train.sh # Submit single-node job
│ ├── submit_param_search.sh # Submit array job
│ └── submit_distributed.sh # Submit distributed job
├── requirements.txt # Python dependencies
└── shared/ # Shared storage for all nodes
├── logs/ # Job output logs
├── models/ # Saved models
└── results/ # Training results
- Build the cluster with ML support:
docker-compose up -d --build
- Verify ML environment:
docker exec -it slurmctld bash
python3 -c "import torch; print(f'PyTorch version: {torch.__version__}')"
Basic classifier training on a single node.
Features:
- Random Forest classifier on synthetic data
- Model saving and metrics logging
- Progress tracking
Run with:
sbatch ml_jobs/submit/submit_train.sh
Uses Slurm job arrays to parallelize hyperparameter search.
Features:
- Multiple parameter combinations
- Parallel evaluation
- Results aggregation
Run with:
sbatch ml_jobs/submit/submit_param_search.sh
Monitor array tasks:
squeue -r # Show individual array tasks
Multi-node distributed training using PyTorch DDP.
Features:
- Distributed data parallel training
- Synchronized model updates
- Per-node progress logging
Run with:
sbatch ml_jobs/submit/submit_distributed.sh
Monitor nodes:
tail -f /shared/logs/nodes/node_*_log.txt # Monitor all nodes
- Purpose: Run same code with different parameters
- Use Case: Hyperparameter search, cross-validation
- Data: Independent data per task
- Communication: No inter-task communication
- Fault Tolerance: Individual tasks can fail independently
- Submit Script Example:
#SBATCH --array=0-3
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
- Purpose: Single training job across multiple nodes
- Use Case: Large model training, distributed SGD
- Data: Shared/distributed across nodes
- Communication: Continuous inter-node communication
- Fault Tolerance: Entire job fails if any node fails
- Submit Script Example:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
Each job type has different logging patterns:
- Single Node Jobs:
tail -f slurm-*_train.out
- Array Jobs:
# Monitor specific array task
tail -f slurm-*_[array_task_id].out
- Distributed Jobs:
# Monitor all nodes
tail -f /shared/logs/nodes/node_*_log.txt
# View saved models
ls -l /shared/models/
# Check results
cat /shared/results/distributed_training_results_*.txt
- Node Communication Issues:
# Check network connectivity
srun --nodes=2 hostname
# Verify distributed setup
python -c "import torch.distributed as dist; print(dist.is_available())"
- Memory Issues:
# Monitor memory usage
sstat -j <jobid> --format=JobID,MaxRSS,MaxVMSize
# Adjust batch size in code if needed
- Failed Tasks in Array Jobs:
# Rerun failed array tasks
sacct -j <jobid> --format=JobID,State
sbatch --array=<failed_indices> submit_script.sh
- Check available resources:
sinfo -o "%n %c %m" # Show nodes, CPUs, and memory
- Monitor job resource usage:
sstat -j <jobid> --format=JobID,AveCPU,AveRSS,AveVMSize
- Cancel jobs:
scancel <jobid> # Cancel specific job
scancel -u $USER # Cancel all your jobs
This setup provides a comprehensive environment for running various types of ML workloads on a Slurm cluster, from simple single-node training to complex distributed training scenarios.