Skip to content

maxivhuber/nvidia-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning Container Setup and Usage Guide

This guide provides instructions for setting up and using Podman containers for running deep learning applications with PyTorch and NVIDIA GPUs.

Useful Resources

Setup Instructions

  1. Project Folder:
    • Rename your project folder to my_project.
  2. Environment Variables:
    • Open the .env/.argfile file in the root directory.
    • Set your project name as an environment variable (e.g., PROJECT_NAME=my_project).
    • Set the Jupyter Lab port (e.g., JUPYTER_PORT=8000).
    • Configure cluster settings (MASTER_PORT, MASTER_ADDR, WORLD_SIZE, NODE_RANK).
    • Set NCCL environment variables.
  3. Requirements File:
    • Add any necessary pip dependencies to the requirements.txt file.

Usage

  • Starting the Container:
    • Run bash build.sh to build and start the container using Podman.
  • Accessing Jupyter Lab:
    • Connect to Jupyter Lab through http://<ip-address>:<JUPYTER_PORT>/?token=<token>
  • Direct File Execution:
    • To directly execute a file, such as a python script, from the terminal, use a command like the following:
      • ( source .env && podman exec -w /workspace/my_project $PROJECT_NAME-$NODE_RANK conda run --live-stream -n accelerate accelerate launch my-project.py --arg1 ../path/to/data )
    • This command sources your environment variables from .env and executes the specified Python script or Jupyter notebook inside the Podman container.

Synchronization between Nodes

  • Synchronization between Nodes with Optional File Execution:
    • The sync folder contains a script for synchronizing your working directory with remote nodes, essential for training on a cluster.
    • The script supports start and stop actions for synchronizing and managing containers on remote nodes.
    • Additionally, the sync/sync.sh command can take an optional fourth argument specifying a file/path (script or notebook) from the project directory, which will then be executed.
    • Starting Synchronization and Containers:
      • Usage: bash sync/sync.sh <local_absolute_path> <remote_relative_path> start [optional_file_path].
      • For example, to start synchronization and execute a script: bash sync/sync.sh ~/my_project .sync/my_project start /scripts/my-script.py.
    • Stopping Remote Containers:
      • Usage: bash sync/sync.sh <local_absolute_path> <remote_relative_path> stop.
      • For example: bash sync/sync.sh ~/my_project .sync/my_project stop.
    • Configuring Sync Settings:
      • Update the sync/config.json file to include your own nodes, their respective SSH access details, and keys. Ensure to replace node1, node2, etc., with your actual node details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published