Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sbatch submission failures do not continue #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

trickytank
Copy link

This is to prevent errors from sbatch causing trouble.

I sometimes have the following error from sbatch:

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

This causes the JOBID to be an empty string, which later causes an error in sacct. This does not resolve as the status script assumes the job is still running.

This fixes the problem by waiting until the job is properly submitted. There is a 10 second wait between submissions as submission failures appear to cluster at the same time.

@nathanhaigh
Copy link
Owner

Thanks for the contribution!

I've had similar sporadic fails with the sacct command used in the status script. I dealt with it using a "retry" function. A similar function in the submit script could be used like this:

function retry {
  local n=1
  local max=5
  local delay=5
  while true; do
    "$@" && break || {
      if [[ $n -lt $max ]]; then
        >&2 echo "WARN: Command ($@) failed on attempt $n/$max:"
        sleep $delay
      else
        >&2 echo "ERROR: Command ($@) failed after $n attempts."
        exit 1
      fi
      ((n++))
    }
  done
}

set -o pipefail
JOBID=$(retry sbatch ${DEP_STRING} ${SBATCH_ARGS} $@ | cut -f4 -d' ')
echo -n "${JOBID}"

This has the advantage of also not becoming stuck in an infinite loop as it breaks out after 5 failed attempts.

What do you think?

@nathanhaigh
Copy link
Owner

See now: https://github.com/UofABioinformaticsHub/snakemake-tutorial/blob/master/profiles/slurm/status

@trickytank
Copy link
Author

It's much nicer to have a generic retry function. For my purposes I'd set the max to a large value, as there have been ~15 minute periods that submission has failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants