This repo is intended as a tutorial and workflow setup for the DGX Machine at Örebro University. Other than an oru.se account you won't need anything. The goal of this tutorial is to enable you as quickly as possible to use the DGX Machine as your machine of choice to develop the newest GPU-heavy algorithms. For the interested reader there is also a user guide you can read that has a stronger focus on deployment instead of development. Note that the user guide uses the terms Data Factory to refer to the DGX Machine. In this tutorial we will use the term Data Factory only to refer to the server that permanently stores data, which is separate from the DGX Machine where jobs are scheduled and the number crunching happens.
As of now, you have to contact the account manager of the Data Factory and the DGX Machine, Currently, this is Andreas Persson (14/08/2023). Write him a mail, he will fix you an account.
Your code on the DGX Machine will be running in Docker containers. It is a good idea to install the docker-engine on your local machine. It is also a good idea to additionally perform the post-installation when your local machine runs Linux. Do also create an account on Dockerhub. This allows you to build custom docker images upload them to Dockerhub and deploy them on the DGX Machine.
First you will want to to build a docker image that has all the necessary dependencies for your project. In docker directory you can already find a docker file ready to be deployed on the DGX Machine (tested on 25/11/22). You can take this docker file and adapt to your needs. For instance, if you want to install additional Python packages simply modify the requirements.txt
file or use the image as yout base image.
Here are the general steps to follow:
-
fire up a terminal type in the following command
git clone https://github.com/pedrozudo/oru-dgx.git cd oru-dgx/docker
-
Adapt the
build.sh
and thepush.sh
files by setting the variables. -
Adapt the
Dockerfile
file and therequirements.txt
file. (Optional: you can choose to installtmux
in your docker image, reason and instructions have been described in section Usingtmux
) -
Build the image:
./build.sh
- Push the image to Dockerhub:
./push.sh
Note that you do not want to put your own code base in your docker image, we will take care of this later one. Simply consider the docker image as the operating system you want to run on the DGX Machine.
Next you will want to mount your home directory on the Data Factory on to your local machine. First, create a directory on your local machine where you want to mount the remote directory to.
mkdir -p ~/mount/datafactory
Now you can simply use the following command to mount the remote directory onto your local machine (sshfs can be installed using sudo apt install sshfs
). Note that you need to be on the oru.se network for this (either physically or via VPN).
sshfs username@10.1.115.65:/mnt/dgx_001/aiqu_data/users/username/ ~/mount/datafactory
Replace in the command above username
with your oru.se username. The password will also be your oru.se password. You can verify whether the mount was successful by running mount | grep sshfs
which should produce:
username@10.1.115.65:/mnt/dgx_001/aiqu_data/users/username/ on /mount/datafactory type fuse.sshfs (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
You can now use the mounted directory to transfer your prototype code and data to the Data Factory from your local machine.
Note: if you have never logged in to the Data Factory, your personal directory does not exist yet. In this case you will first have to login using the webinterface. The steps are described in the next section.
Let's assume you have some code and data in the ~/mount/datafactory/AGI
directory and you want to run your algorithm on the DGX Machine now. Here is what you have to to.
-
Go to horizon.oru.se, look for
oru.aiqu.se
and login. You will end up in a dashboard looking like this: -
Click on the
Jobs
tab on the left and populate the following fields:- Job Label: just enter the name you want your job to have
- Image: enter the docker image (on Dockerhub) you want to use, for instance,
pedrozudo/oru-dgx:torch-2.0.1
Also adjust the number of GPUs you need and for how long (in minutes) you want your job to be running. You can also expose ports (see the user guide for further details on this).
Next, click on
Advanced Settings
and mount the/Home Catalog/AGI
directory.You are now good to go. Click on
Queue Job
and wait for your job to be scheduled.Once this happens, open a terminal for your job (on the far right in the job list). List the files and directories (
ls
). Your project should now be in theAGI
directory. You can cd into it and run your algorithmcd AGI python agi.py
-
You will probably have a bug or two in your code which you would like to fix. On your local machine, go to the
~/mount/datafactory/AGI/
directory and fix your bug. The cool thing is that both the~/mount/datafactory/AGI/
directory on your local machine and the/AGI/
directory on the DGX Machine were mounted from the same directory on the Data Factory. This means that changing a file on your local machine will be reflected within the running docker image on the DGX machine. You can now code away on the DGX Machine while using the comfort of your local setup.
The horizon.oru.se interface allows opening a terminal for your job, but if that terminal is closed (accidentally, on purpose, or due to logout) the logs are lost and you can't "reconnect" to the previously open terminal. A workaround is to run your job inside a tmux
session; even if the terminal is closed the tmux
session lives on which you can attach to. Head over to the official tmux documentation to learn more.
A neat tmux
feature — it's after all a terminal multiplexer — is that you can stack several terminals and switch between them through keyboard shortcuts, thus letting you have multiple terminals open at the same time. In tmux
parlance, each terminal is a window and multiple windows comprise a session.
This section is a primer on how to get started with tmux
. If there is any difference between what's here and the official documentation, the latter is correct (in which case please consider opening a PR with the correction).
Installation via apt
is the recommended way:
sudo apt update && sudo apt install tmux
In your Dockerfile
, if you are already installing packages through apt
, simply throw in tmux
and it will get installed in your docker image.
(a good video to start with is Fireship: Tmux in 100 seconds)
Right when your docker image is spun up as a container on the DGX (i.e. right when your job starts), say you want a tmux
session named mysession
to start with three windows: 1) htop
, 2) watch -n1 nvidia-smi
, 3) a regular terminal for you to do what you like. (It's possible to add more windows to an existing session)
Place the following command at the end your Dockerfile
(make sure this is not overridden elsewhere):
CMD tmux new-session -d -s mysession \; \
send-keys 'htop' C-m \; \
new-window -t mysession:1 \; \
rename-window 'nvidia-smi' \; \
send-keys 'watch -n1 nvidia-smi' C-m \; \
new-window -t mysession:2 \; \
send-keys '/bin/bash' C-m \; \
attach-session -t mysession
On horizon.oru.se, open up a terminal from your active job, and type tmux ls
. This will list mysession
as the session that is already running, which you can attach to with tmux a
. To reiterate, the whole point of using tmux
is that even if the terminal gets closed, the tmux
session does not die so you can simply attach to the running session from a new terminal.
All keyboard shortcuts for interacting with a tmux
session start with Ctrl + B
followed by the keybinding for the specific operation. The three main keybindings to remember are (Ctrl + B
comes before every one):
W
: Brings up list of windows in the session. Use arrow keys to reach a window and hitenter
to select it.0-9
: Directly brings up the window matching the index number without requiring to go to the list of windows first.[
: Activates scroll mode inside a window. Use up-down direction keys to scroll, andPage Up/Down
to jump a page. Hitq
to deactivate scroll mode.