A streamlined framework for setting up a multi-node, GPU-accelerated, distributed system for PyTorch workloads using Docker Swarm. With ShadowSWARM, you can quickly configure and deploy a scalable environment for machine learning inference or training across multiple machines.
- Automated Docker Swarm initialization and worker node setup.
- Flexible configuration using interactive CLI (
config.py
). - Dynamic IP and hostname detection for seamless multi-node deployment.
- Streamlined distributed PyTorch workloads with Fully Sharded Data Parallel (FSDP).
- Integrated Streamlit interface for easy interaction with your system.
-
Docker and NVIDIA Drivers:
- Install Docker and NVIDIA drivers on all machines.
- Install the NVIDIA Container Toolkit:
sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker
- Verify Docker GPU support:
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi
-
Python 3.8+:
- Install Python on the master machine:
sudo apt-get install python3 python3-pip
- Install Python on the master machine:
-
Passwordless SSH:
- Configure passwordless SSH from the master to all worker nodes:
ssh-keygen -t rsa -b 2048 ssh-copy-id user@worker-ip
- You only need to set up SSH from the master node to the workers.
- The worker nodes do not need SSH access to each other or the master.
- Configure passwordless SSH from the master to all worker nodes:
-
Clone the Repository (Only on Master):
- Clone this repository on the master node:
git clone https://github.com/DJStompZone/shadowswarm.git cd shadowswarm
- The worker nodes do not need the repository because Docker Swarm handles the deployment of containers automatically.
- Clone this repository on the master node:
-
Build the Docker Image: Build the Docker image on the master node:
docker build -t shadowswarm-app .
-
Run the Configuration Script: Use the interactive CLI to gather and validate the necessary configuration:
python3 config.py
This script will:
- Prompt for the master and worker node details.
- Save the configuration to a
.env
file. - Start the
bootstrap.sh
script to initialize Docker Swarm and add workers.
-
Verify Swarm Setup: Check the Swarm status after the bootstrap:
docker node ls
-
Deploy the Docker Stack: Once the Swarm is ready, deploy the application:
docker stack deploy --compose-file docker-compose.yml shadowswarm
-
Open a browser and navigate to the master node IP:
http://<master-node-ip>:8501
-
Use the Streamlit interface to interact with your distributed PyTorch system.
shadowswarm/
├── config.py # CLI script for gathering configuration
├── bootstrap.sh # Script for initializing Docker Swarm and adding workers
├── docker-compose.yml # Docker Swarm stack configuration
├── Dockerfile # Docker image definition
├── .env # Environment variables for the deployment
├── app/ # Application directory
│ ├── main.py # PyTorch and Streamlit code
│ └── utils.py # Utility functions
-
Configuration:
config.py
prompts for master and worker node details, saves them to.env
, and triggersbootstrap.sh
.
-
Swarm Initialization:
bootstrap.sh
initializes Docker Swarm on the master node and connects workers via SSH.
-
Stack Deployment:
docker-compose.yml
orchestrates the master and worker containers, assigning roles using environment variables.
-
Distributed Workload:
- The master node manages the distributed PyTorch workload across all nodes using Fully Sharded Data Parallel (FSDP).
Variable | Description |
---|---|
MASTER_HOSTNAME |
Hostname of the master node. |
MASTER_IP |
IP address of the master node. |
WORKER_HOSTNAMES |
Comma-separated list of worker hostnames. |
NODE_RANK |
Rank of the node in the distributed setup. |
WORLD_SIZE |
Total number of nodes in the cluster. |
MASTER_PORT |
Port for master-worker communication. |
-
Docker Swarm Issues:
- Check if Swarm is initialized:
docker info
- Verify worker nodes are connected:
docker node ls
- Check if Swarm is initialized:
-
SSH Issues:
- Test passwordless SSH from the master:
ssh <worker-ip>
- Test passwordless SSH from the master:
-
Container Logs:
- Check the logs for the master or workers:
docker service logs shadowswarm_master docker service logs shadowswarm_worker1
- Check the logs for the master or workers:
-
GPU Issues:
- Ensure GPUs are accessible:
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi
- Ensure GPUs are accessible:
-
Add a new worker node to the swarm:
docker swarm join --token <worker-join-token> <master-ip>:2377
-
Update the
WORKER_HOSTNAMES
in the.env
file to include the new worker. -
Re-deploy the stack:
docker stack deploy --compose-file docker-compose.yml shadowswarm
Contributions are welcome! Please open an issue or submit a pull request if you have problems, suggestions, or improvements.
This project is licensed under the MIT License.