Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vasp GPU (single node) podman #4

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

Conversation

tonykew
Copy link

@tonykew tonykew commented Sep 19, 2024

Instructions to build VASP with Nvidia GPU support in podman
and a sample Slurm script to run the podman tarball image.

Tony

Includes build & run information

Note: "wget" targets will need to be updated once this is checked into
main at https://github.com/ubccr/ccr-examples

Tony
Made build script executable
Added copying the VASP tarball to ${SLURMTMPDIR}
Make fix_nvhpc_.pc_files.bash script executable
32 process parallel builds requires 32 cores in the salloc command.

Tony
Update salloc for parallel build
Add output with the path to the sample Slurm script
Tony
Includes build & run information

Note: "wget" targets will need to be updated once this is checked into
main at https://github.com/ubccr/ccr-examples

Tony
${LOGNAME} may not be set, or may be changed, whereas $(id -un) will
always return the username

Tony
KevinVanSlyke added a commit to KevinVanSlyke/ccr-examples that referenced this pull request Sep 25, 2024
commit 60d479f
Merge: ba1abe3 91e6dec
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 15:49:34 2024 -0400

    Merge branch 'main' into VASP-podman

commit ba1abe3
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 15:48:05 2024 -0400

    Use "id -un" for the username rather than the LOGNAME variable

    ${LOGNAME} may not be set, or may be changed, whereas $(id -un) will
    always return the username

    Tony

commit 91e6dec
Merge: 8617fb9 4d60a5f
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 09:54:27 2024 -0400

    Merge pull request #11 from tonykew/VASP-podman

    Vasp podman

commit 4d60a5f
Merge: 45ab638 8617fb9
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 09:52:16 2024 -0400

    Merge branch 'main' into VASP-podman

commit 45ab638
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:30:59 2024 -0400

    Initial check in of VASP-GPU-podman example

    Includes build & run information

    Note: "wget" targets will need to be updated once this is checked into
    main at https://github.com/ubccr/ccr-examples

    Tony

commit 8617fb9
Merge: 8f7a600 9a31f57
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:44:05 2024 -0400

    Merge pull request #10 from tonykew/VASP-podman

    Fix (another!) typo

commit 9a31f57
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:42:58 2024 -0400

    Fix (another!) typo

    Tony

commit 8f7a600
Merge: 1c0f4d8 0add716
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:42:19 2024 -0400

    Merge pull request #9 from tonykew/VASP-podman

    Fix typo

commit 0add716
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:41:30 2024 -0400

    Fix typo

    Tony

commit 1c0f4d8
Merge: e7ee61c 1abbe2d
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:31:07 2024 -0400

    Merge pull request #8 from tonykew/VASP-podman

    Add output with the path to the sample Slurm script

commit 1abbe2d
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:29:51 2024 -0400

    Add output with the path to the sample Slurm script

    Tony

commit e7ee61c
Merge: 30ac9f2 3f69eb9
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:15:39 2024 -0400

    Merge pull request #7 from tonykew/VASP-podman

    Fix typos

commit 3f69eb9
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 02:14:54 2024 -0400

    Fix typos

    Tony

commit 30ac9f2
Merge: 726c196 9b9df0c
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 01:24:15 2024 -0400

    Merge pull request #6 from tonykew/VASP-podman

    Update salloc for parallel build

commit 9b9df0c
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Thu Sep 19 01:21:33 2024 -0400

    Update salloc for parallel build

    32 process parallel builds requires 32 cores in the salloc command.

    Tony

commit 726c196
Merge: b61401a b03025f
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 19:17:21 2024 -0400

    Merge pull request #5 from tonykew/VASP-podman

    Make fix_nvhpc_.pc_files.bash script executable

commit b03025f
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 19:15:51 2024 -0400

    Make fix_nvhpc_.pc_files.bash script executable

    Tony

commit b61401a
Merge: ec5024c b1a8270
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:56:57 2024 -0400

    Merge pull request ubccr#4 from tonykew/VASP-podman

    Added copying the VASP tarball to ${SLURMTMPDIR}

commit b1a8270
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:55:15 2024 -0400

    Added copying the VASP tarball to ${SLURMTMPDIR}

    Tony

commit ec5024c
Merge: 5fdda10 68f63ad
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:51:34 2024 -0400

    Merge pull request ubccr#3 from tonykew/VASP-podman

    Made build script executable

commit 68f63ad
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:50:43 2024 -0400

    Made build script executable

    Tony

commit 5fdda10
Merge: 22f6ee0 1380cbd
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:48:34 2024 -0400

    Merge pull request ubccr#2 from tonykew/VASP-podman

    Fixed wget URLs

commit 1380cbd
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:45:38 2024 -0400

    Fixed wget URLs

    Tony

commit 22f6ee0
Merge: 8c5caec a4bcf15
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:41:53 2024 -0400

    Merge pull request ubccr#1 from tonykew/VASP-podman

    Vasp podman

commit a4bcf15
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:37:26 2024 -0400

    Fixed typos

    Tony

commit 40c8ddd
Author: Tony Kew <tony.d.r.kew@gmail.com>
Date:   Wed Sep 18 18:30:59 2024 -0400

    Initial check in of VASP-GPU-podman example

    Includes build & run information

    Note: "wget" targets will need to be updated once this is checked into
    main at https://github.com/ubccr/ccr-examples

    Tony
@aebruno aebruno requested a review from KevinVanSlyke October 16, 2024 15:22
@aebruno aebruno requested a review from ANekhai October 16, 2024 17:50
Copy link
Collaborator

@ANekhai ANekhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are sort of my initial comments from validating the build pipeline. My meta question is do we need three example run files? I think having a containerfile, build file, and slurm file should be more than plenty for users. I personally don't think we should support any other kinds of runs with scripts, if they want to run VASP GPU locally with a shell I say we leave this up to them. Please correct me if I'm misunderstanding the scope of the run scripts.

data_dir="/projects/academic/tonykew/tonykew/VASP/data"
##############################################################################

echo "-------------------------------------------------------------------------------"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment kinda pertains to this file. I have concerns that there is just too much going on for a user. As a user, if I were looking for a slurm script to try, I would probably keep looking until I found an extremely simple example which I didn't have to read closely to understand.

My concern is that this script has far too many echo and conditional statements to check for directory structure and print out job info. I would rather an example of "hey this is how you can print out a ton of job info in a slurm script" example live somewhere else, uncoupled to trying to run VASP.

I don't think we should check to see if they have all the correct files when running VASP, just to have one less thing that will fail quietly. I'd rather VASP throw it's errors, rather than us check for them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes I've made have resolved this?

fi

# run vasp_std in the data directory
vasp_std
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer keeping this line, and maybe a one-line comment about directory and file structure so a user can easily identify which lines are running and which lines are failing. Otherwise, we're going to get help desk tickets to try and troubleshoot these scripts, which defeats the purpose of having an example repository.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only sections the user needs to change (beyond the SLURM directives)
are clearly marked now.
I think these changes resolve this?

Text and variable name for the OCI image targall directory changed
Sample directory paths changed to follow the CCR standard path naming
 conventions
Added a check for the podman binary - bail if missing
Tidied some logic for the sample data directory path
Renamed the sample SLURM script to sample_SLURM_vasp_std.bash

Tony
Copy link
Contributor

@dsajdak dsajdak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some formatting comments and some questions

slurm/2_ApplicationSpecific/VASP-GPU-podman/README.md Outdated Show resolved Hide resolved
slurm/2_ApplicationSpecific/VASP-GPU-podman/README.md Outdated Show resolved Hide resolved
slurm/2_ApplicationSpecific/VASP-GPU-podman/README.md Outdated Show resolved Hide resolved
slurm/2_ApplicationSpecific/VASP-GPU-podman/README.md Outdated Show resolved Hide resolved
slurm/2_ApplicationSpecific/VASP-GPU-podman/README.md Outdated Show resolved Hide resolved
then
default_acct="$(sacctmgr -rnp show User "$(id -un)" | awk -F'|' '{print $2}')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what this section and the next are intended for? If the Slurm script exists then an account is included in it and a data path should be set in it.

Copy link
Author

@tonykew tonykew Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are referring to this section:

if [ -f sample_SLURM_vasp_std.bash ]
then
  default_acct="$(sacctmgr -rnp show User "$(id -un)" | awk -F'|' '{print $2}')"
  sed -E -i -e "/#SBATCH[[:space:]]+--account=/s|\".*\"|\"${default_acct}\"|" \
   -e "/#SBATCH[[:space:]]+--chdir=/s|\".*\"|\"${base_dir}\"|" \
   -e "s|^bin_dir=.*|bin_dir=\"${bin_dir}\"|" \
   "sample_SLURM_vasp_std.bash"
  # try to provide a reasonable data dir path
  grp_group="$(groups | sed 's/ /\n/g' | grep ^grp- | head -1 | sed 's/^grp-//')"
  if [ "${grp_group}" != "" ]
  then
    if [ -d "/projects/academic/${grp_group}" ]
    then
      base_dir="/projects/academic/${grp_group}/$(id -un)/VASP"
      sed -E -i -e "s|^data_dir=.*|data_dir=\"${base_dir}/data\"|" \
       "sample_SLURM_vasp_std.bash"
    fi
  fi
  cp "sample_SLURM_vasp_std.bash" "${bin_dir}/sample_SLURM_vasp_std.bash"
else
  echo "sample script \"sample_SLURM_vasp_std.bash\" missing?" >&2
fi

This substitutes the users installed OCI directory in bin directory path into
the SLURM sample script, set the default SLURM account to the user's
group & sets sample paths based on the user's username & group, and copies
this sample script to the user's script dir

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly. Thanks for the explanation. I personally think this is too much for the examples repo. I think the simpler the better, less to break and less to explain to users. However, others may have a different opinion.

Removed references to the individual back end scripts
Trimmed down the sample SLURM script

Tony
Replaced "ccrgroup" with "[YourGroupName]"
Replaced "ccruser" with "[CCRusername]"

Tony
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants