Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC: Monitoring Progress & Aborting #5584

Open
ax3l opened this issue Jan 21, 2025 · 1 comment
Open

HPC: Monitoring Progress & Aborting #5584

ax3l opened this issue Jan 21, 2025 · 1 comment
Assignees
Labels
machine / system Machine or system-specific issue

Comments

@ax3l
Copy link
Member

ax3l commented Jan 21, 2025

In unsupervised (or: ideally they should be unsupervised, but I am checking the status every other minute) runs such as batched HPC execution, it is preferable that the operator does not have to be synchronously monitoring a running job.

In real-world loads of HPC systems (changing loads on networks, filesystems, OoM scenarios, changing software, etc.), it is not uncommon that a batched job starts up outside of regular working hours and in some cases, causes a hang until walltime. In some cases, this can be costly.

We should establish a mechanism (in job scripts) that programatically monitors progress / health of a simulation and if a configurable timeout is reached, aborts the simulation, first with sigterm (for backtrace generation) and then sigkill.

Possible Implementations

A very simple implementation could be to write some kind of status (e.g., the current time) into a file (e.g., from the I/O processor) every time step. In the batch job, a single polling process could check the time difference.

File-based I/O is of course far from ideal, e.g., due to sync, load, short time steps, or for I/O-free runs (e.g., optimization). Better might be to have a port open for health queries (could be later reused to query things like memory usage per MPI process, load, etc.) or to react on a POSIX signal and print something on a specific channel (e.g., stderr), like dd does.

@ax3l ax3l added the machine / system Machine or system-specific issue label Jan 21, 2025
@ax3l
Copy link
Member Author

ax3l commented Feb 18, 2025

A simple way to monitor locally from the job script would use the mtime (i.e. ls -l) of our output.txt, which is periodically updated from progress status on stdout.

Logic would be:

  • srun job in background (appending &)
  • checking the output of ls -l output.txt or stat -c %y output.txt with a N minute sleep interval, if the time difference on the file is larger than the sleep interval, stop the job
# ...

srun ... > output_${SLURM_JOBID}.txt &
srun_pid=$!

timeout_sec=300  # timeout: 5min
while true
do
    sleep ${timeout_sec}
    file_mtime=$(stat -c %Y output_${SLURM_JOBID}.txt)
    now_time=$(date +%s)
    diff_sec=$((now_time - file_mtime))

    if [[ ${diff_sec} -ge ${timeout_sec} ]]
    then
        echo "Job did not progress for ${timeout_sec} seconds..."
        echo "Probably hanging... Will terminate now."
        kill -9 ${srun_pid}
        scancel ${SLURM_JOBID}
    fi
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
machine / system Machine or system-specific issue
Projects
None yet
Development

No branches or pull requests

1 participant