GitHub - maxivhuber/nvidia-docker

Deep Learning Container Setup and Usage Guide

This guide provides instructions for setting up and using Podman containers for running deep learning applications with PyTorch and NVIDIA GPUs.

Project Folder:
- Rename your project folder to my_project.
Environment Variables:
- Open the .env/.argfile file in the root directory.
- Set your project name as an environment variable (e.g., PROJECT_NAME=my_project).
- Set the Jupyter Lab port (e.g., JUPYTER_PORT=8000).
- Configure cluster settings (MASTER_PORT, MASTER_ADDR, WORLD_SIZE, NODE_RANK).
- Set NCCL environment variables.
Requirements File:
- Add any necessary pip dependencies to the requirements.txt file.

Starting the Container:
- Run bash build.sh to build and start the container using Podman.
Accessing Jupyter Lab:
- Connect to Jupyter Lab through http://<ip-address>:<JUPYTER_PORT>/?token=<token>
Direct File Execution:
- To directly execute a file, such as a python script, from the terminal, use a command like the following:
  - ( source .env && podman exec -w /workspace/my_project $PROJECT_NAME-$NODE_RANK conda run --live-stream -n accelerate accelerate launch my-project.py --arg1 ../path/to/data )
- This command sources your environment variables from .env and executes the specified Python script or Jupyter notebook inside the Podman container.

Synchronization between Nodes with Optional File Execution:
- The sync folder contains a script for synchronizing your working directory with remote nodes, essential for training on a cluster.
- The script supports start and stop actions for synchronizing and managing containers on remote nodes.
- Additionally, the sync/sync.sh command can take an optional fourth argument specifying a file/path (script or notebook) from the project directory, which will then be executed.
- Starting Synchronization and Containers:
  - Usage: bash sync/sync.sh <local_absolute_path> <remote_relative_path> start [optional_file_path].
  - For example, to start synchronization and execute a script: bash sync/sync.sh ~/my_project .sync/my_project start /scripts/my-script.py.
- Stopping Remote Containers:
  - Usage: bash sync/sync.sh <local_absolute_path> <remote_relative_path> stop.
  - For example: bash sync/sync.sh ~/my_project .sync/my_project stop.
- Configuring Sync Settings:
  - Update the sync/config.json file to include your own nodes, their respective SSH access details, and keys. Ensure to replace node1, node2, etc., with your actual node details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
my_project		my_project
sync		sync
.env		.env
Dockerfile		Dockerfile
README.md		README.md
argfile.conf		argfile.conf
build.sh		build.sh
requirements.txt		requirements.txt