In this tutorial we present three open source projects that form a core set of utilities commonly installed at High Performance Computing (HPC) centers.
An overview of the containers in the cluster:
If you haven't already installed and tested the required packages, please refer to the requirements page
You will need to clone the tutorial repo and then run the helper script. The initial clone of the repo may take 5-10 minutes. The first time running the helper script, you'll be downloading all the containers from Docker Hub. This can take quite a long time depending on your network speed. The images total approximately 13GB in size. Once the containers are downloaded, they are started and the services launched. For point of reference: on a recent test from a home fiber optic network with client connected over wifi this download and container startup process took 12 minutes.
NOTE: For Windows, if you haven't already done so, you will need to configure git not to convert line endings into Windows format. Run this command using the git-bash shell application before cloning the tutorial repo:
git config --global core.autocrlf input
$ git clone https://github.com/ubccr/hpc-toolset-tutorial.git
$ cd hpc-toolset-tutorial
$ ./hpcts start
Fetching latest HPC Toolset Images..
[+] Running 12/12
- base Pulled 5.2s
- ondemand Pulled 5.3s
- cpn01 Pulled 5.3s
- cpn02 Pulled 5.1s
- mongodb Pulled 5.2s
- xdmod Pulled 5.1s
- ldap Pulled 5.2s
- mysql Pulled 5.2s
- coldfront Pulled 5.2s
- frontend Pulled 5.2s
- slurmdbd Pulled 5.1s
- slurmctld Pulled 5.2s
Starting HPC Toolset Cluster..
[+] Running 23/23
- Network hpc-toolset-tutorial_compute Created 0.1s
- Volume "hpc-toolset-tutorial_etc_slurm" Created 0.0s
- Volume "hpc-toolset-tutorial_cpn02_slurmd_state" Created 0.0s
- Volume "hpc-toolset-tutorial_slurmdbd_state" Created 0.0s
- Volume "hpc-toolset-tutorial_slurmctld_state" Created 0.0s
- Volume "hpc-toolset-tutorial_data_db" Created 0.0s
- Volume "hpc-toolset-tutorial_home" Created 0.0s
- Volume "hpc-toolset-tutorial_var_lib_mysql" Created 0.0s
- Volume "hpc-toolset-tutorial_srv_www" Created 0.0s
- Volume "hpc-toolset-tutorial_cpn01_slurmd_state" Created 0.0s
- Volume "hpc-toolset-tutorial_etc_munge" Created 0.0s
- Container mongodb Started 12.0s
- Container mysql Started 11.9s
- Container ldap Started 11.8s
- Container hpc-toolset-tutorial-base-1 Started 12.3s
- Container slurmdbd Started 13.2s
- Container slurmctld Started 13.0s
- Container frontend Started 15.2s
- Container cpn02 Started 14.2s
- Container cpn01 Started 15.2s
- Container ondemand Started 15.2s
- Container coldfront Started 15.7s
- Container xdmod Started 15.5s
Coldfront URL: https://localhost:2443
OnDemand URL: https://localhost:3443
XDMoD URL: https://localhost:4443
Login to frontend: ssh -p 6222 hpcadmin@localhost
NOTE: Despite seeing this output with URLs, the processes on these containers may not be fully running yet. Depending on the speed of your computer, starting up the processes may take a few minutes (or even up to 10 minutes). Use the command below to check the docker logs if the websites are not yet displaying.
NOTE: Windows users should get several pop-up messages from Docker Desktop during this process asking to allow local system access to the Docker containers. Please click the "Share it" button:
If you have notifications blocked, you may not see these pop-ups and the authorization will eventually time out. If this happens, you will get this type of error message:
Error response from daemon: user declined directory sharing C:\Users\path_to_my_folder
Open Docker Desktop, navigate to Settings - Resources, and click on File Sharing. Then add the directory where you've cloned the HPC Toolset Tutorial and click "Apply & Restart"
Re-run:
./hpcts start
If this doesn't work, please run:
./hpcts destroy
./hpcts start
Once the helper script finishes you can check the status of the containers:
$ docker-compose logs -f
mysql | 200620 4:03:42 [Note] Event Scheduler: Loaded 0 events
mysql | 200620 4:03:42 [Note] mysqld: ready for connections.
frontend | ---> Starting the MUNGE Authentication service (munged) ...
frontend | ---> Starting sshd on the frontend...
cpn01 | slurmd: Munge credential signature plugin loaded
cpn01 | slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=15575 TmpDisk=229951 Uptime=43696 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
cpn02 | slurmd: debug: AcctGatherEnergy NONE plugin loaded
coldfront | -- Waiting for database to become active ...
coldfront | -- Initializing coldfront database...
ondemand | ---> Starting ondemand httpd24...
slurmdbd | slurmdbd: debug2: DBD_NODE_STATE_UP: NODE:cpn01 REASON:(null) TIME:1592625828
slurmctld | slurmctld: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
xdmod | 2020-06-21 19:23:48 [notice] xdmod-ingestor end (process_end_time: 2020-06-21 19:23:48)
xdmod | ---> Starting XDMoD...
Please see our troubleshooting section for more info.
If errors are showing up in the logs or the services have not all started, check to see which images have been downloaded and which containers are running. This is what you should see:
If not, run the 'destroy' option of the helper script to shut everything down and remove all volumes. Then start everything back up again:
$ ./hpcts destroy
$ docker container list
(Should show no containers)
$ docker volume list
(Should show no volumes)
If either of the above do, you should run the corresponding remove command:
$ docker container rm [ContainerID]
$ docker volume rm [VolumeName]
Then start it all up again:
./hpcts start
Since you already downloaded all the images, this command will only startup the containers and services which only takes a few minutes.
To completely start over and re-download all images, run the cleanup script and then startup script:
$ ./hpcts cleanup
$ ./hpcts start
NOTE: The cleanup script removes ALL containers, images and volumes except the mongo and mariadb images. If you're getting database errors we recommend you remove these manually with these docker commands:
$ docker image list
$ docker image rm [IMAGE IDs for mongo and mariadb images]
$ ./hpcts start