Skip to content

ShadowSWARM is a streamlined framework for setting up a multi-node, GPU-accelerated, distributed system for PyTorch workloads using Docker Swarm

License

Notifications You must be signed in to change notification settings

DJStompZone/ShadowSWARM

Repository files navigation

ShadowSwarm



A streamlined framework for setting up a multi-node, GPU-accelerated, distributed system for PyTorch workloads using Docker Swarm. With ShadowSWARM, you can quickly configure and deploy a scalable environment for machine learning inference or training across multiple machines.


Features

  • Automated Docker Swarm initialization and worker node setup.
  • Flexible configuration using interactive CLI (config.py).
  • Dynamic IP and hostname detection for seamless multi-node deployment.
  • Streamlined distributed PyTorch workloads with Fully Sharded Data Parallel (FSDP).
  • Integrated Streamlit interface for easy interaction with your system.

Quickstart Guide

Prerequisites

  1. Docker and NVIDIA Drivers:

    • Install Docker and NVIDIA drivers on all machines.
    • Install the NVIDIA Container Toolkit:
      sudo apt-get install -y nvidia-container-toolkit
      sudo systemctl restart docker
    • Verify Docker GPU support:
      docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi
  2. Python 3.8+:

    • Install Python on the master machine:
      sudo apt-get install python3 python3-pip
  3. Passwordless SSH:

    • Configure passwordless SSH from the master to all worker nodes:
      ssh-keygen -t rsa -b 2048
      ssh-copy-id user@worker-ip
      • You only need to set up SSH from the master node to the workers.
      • The worker nodes do not need SSH access to each other or the master.

Installation

  1. Clone the Repository (Only on Master):

    • Clone this repository on the master node:
      git clone https://github.com/DJStompZone/shadowswarm.git
      cd shadowswarm
    • The worker nodes do not need the repository because Docker Swarm handles the deployment of containers automatically.
  2. Build the Docker Image: Build the Docker image on the master node:

    docker build -t shadowswarm-app .

Setup and Deployment

  1. Run the Configuration Script: Use the interactive CLI to gather and validate the necessary configuration:

    python3 config.py

    This script will:

    • Prompt for the master and worker node details.
    • Save the configuration to a .env file.
    • Start the bootstrap.sh script to initialize Docker Swarm and add workers.
  2. Verify Swarm Setup: Check the Swarm status after the bootstrap:

    docker node ls
  3. Deploy the Docker Stack: Once the Swarm is ready, deploy the application:

    docker stack deploy --compose-file docker-compose.yml shadowswarm

Access the Streamlit App

  1. Open a browser and navigate to the master node IP:

    http://<master-node-ip>:8501
    
  2. Use the Streamlit interface to interact with your distributed PyTorch system.

File Structure

shadowswarm/
├── config.py            # CLI script for gathering configuration
├── bootstrap.sh         # Script for initializing Docker Swarm and adding workers
├── docker-compose.yml   # Docker Swarm stack configuration
├── Dockerfile           # Docker image definition
├── .env                 # Environment variables for the deployment
├── app/                 # Application directory
│   ├── main.py          # PyTorch and Streamlit code
│   └── utils.py         # Utility functions

How It Works

  1. Configuration:

    • config.py prompts for master and worker node details, saves them to .env, and triggers bootstrap.sh.
  2. Swarm Initialization:

    • bootstrap.sh initializes Docker Swarm on the master node and connects workers via SSH.
  3. Stack Deployment:

    • docker-compose.yml orchestrates the master and worker containers, assigning roles using environment variables.
  4. Distributed Workload:

    • The master node manages the distributed PyTorch workload across all nodes using Fully Sharded Data Parallel (FSDP).

Environment Variables

Variable Description
MASTER_HOSTNAME Hostname of the master node.
MASTER_IP IP address of the master node.
WORKER_HOSTNAMES Comma-separated list of worker hostnames.
NODE_RANK Rank of the node in the distributed setup.
WORLD_SIZE Total number of nodes in the cluster.
MASTER_PORT Port for master-worker communication.

Troubleshooting

  1. Docker Swarm Issues:

    • Check if Swarm is initialized:
      docker info
    • Verify worker nodes are connected:
      docker node ls
  2. SSH Issues:

    • Test passwordless SSH from the master:
      ssh <worker-ip>
  3. Container Logs:

    • Check the logs for the master or workers:
      docker service logs shadowswarm_master
      docker service logs shadowswarm_worker1
  4. GPU Issues:

    • Ensure GPUs are accessible:
      docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi

Scaling

  1. Add a new worker node to the swarm:

    docker swarm join --token <worker-join-token> <master-ip>:2377
  2. Update the WORKER_HOSTNAMES in the .env file to include the new worker.

  3. Re-deploy the stack:

    docker stack deploy --compose-file docker-compose.yml shadowswarm

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have problems, suggestions, or improvements.

License

This project is licensed under the MIT License.

About

ShadowSWARM is a streamlined framework for setting up a multi-node, GPU-accelerated, distributed system for PyTorch workloads using Docker Swarm

Topics

Resources

License

Stars

Watchers

Forks