Skip to content

Spartan Job Monitoring

rNLKJA edited this page Mar 14, 2023 · 3 revisions

When submitting jobs to the batch system, users need to specify how many CPUs, GPUs, and how much RAM their job requires.

But requesting those resources does not necessarily mean that the job actually uses those resources. Users have commonly asked the Spartan admins to help them monitor the resource usage of their jobs. We have developed a system that helps the users monitor their running jobs.

This is in addition to the simple aggregate summary that can be invoked if the email options are included in a job submission script:

#SBATCH --mail-user=myemail@example.edu.au
#SBTACH --mail-type=ALL

How does the system work?

At regular 15 second intervals, CPU, GPU usage and other relevant metrics are collected from each Spartan worker node and forwarded to a central database. The system provides a command line tool that queries this list of stored metrics, analyzes them and displays the results in an easy-to-understand format.

Using the job monitoring system

The my-job-stats tool can be used to monitor a given jobID.

  • Log into Spartan via SSH
  • Run my-job-stats -j jobID -a, with the jobID replaced by an actual job number.

You can see the possible options by running my-job-stats -h

Real time monitoring

Once your job is running on the node, you can connect to the node that it is running on, via 2 methods. You can ssh to the node, or connect to the job, both from the login node.

SSH to job

Using squence, you can find out which node your job is running on

squeue -j [jobID] -o %i,%N
JOBID,NODELIST
28715643,spartaj-bm083

The above example shows the job is running on spartan-bm083, so you can ssh it from the login node while your job is running.

ssh spartan-bm083

When you ssh to the node, if you have multiple jobs on the same node, your SSH session is randomly put into the container of one of your jobs running on the node.

Use srun to connect to job

Using srun, you can connect directly to the session of a job running on the worker node. Using the job ID, from the login node, you can do:

srun --interactive --jobid 28715643 --pty /bin/bash

which would give you a bash terminal inside of your job, or you could run a command directly e.g.

srun --interactive --jobid 28715643 --pty nvidia-smi

The job is running on