Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix non bfb issues with latent heat in p3 on PM-GPU #2998

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tcclevenger
Copy link
Contributor

This PR...

  • reverts 9438a98 (changed P3 to use constants for latent_heat variables instead of allocating 2d views during runtime)
  • reimplements the main goal of those changes (no view allocs during runtime) but leaves these variables as views (now in the workspace manager (monolithic) or mem buffer (small kernels)).

I don't know why the previous version was non-BFB, but investigating may take some time and PM-GPU is needed for PR testing, so I suggest merging this PR. I've added "TODO" statements to track that eventually these should just be constants and can create an issue once this is merged (if others agree). The only downside is we keep the 3 temp views.

Testing

I ran the following

./cime/scripts/create_test e3sm_scream_v1 e3sm_scream_v1_long --machine pm-gpu --compiler gnugpu -c -b master -t latent_heat_pr
./cime/scripts/create_test e3sm_scream_v1_medres --machine pm-cpu --compiler=gnu -c -b master -t latent_heat_pr

and passed all baselines for CPU and GPU.

@tcclevenger tcclevenger added p3 regarding p3 microphysics bugfix labels Sep 16, 2024
@tcclevenger tcclevenger self-assigned this Sep 16, 2024
Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this temporary fix, but I would encourage to chase down the bug. Honestly, I'm not sure that the issue is with view-vs-scalar itself. I'm more concerned about some deeper bugs, such as scratch views actually aliasing each other, or stuff like that. Maybe you could keep the extra views in the buffers as well as the fcn signatures, but use C::latice and C::latvap in the whole code, and see if you are bfb.

Being BFB with what we had earlier is not necessarily a good thing.

@tcclevenger
Copy link
Contributor Author

I've narrowed down to the offending usage of latent_heat_fusion. I'm waiting to merge this in case the fix is simple and does not require updating baselines.

@@ -78,9 +76,9 @@ ::update_prognostic_ice(
qv.set(context, qv + (-qv2qi_vapdep_tend+qi2qv_sublim_tend-qv2qi_nucleat_tend)*dt);

constexpr Scalar INV_CP = C::INV_CP;
th_atm.set(context, th_atm + inv_exner * ((qv2qi_vapdep_tend - qi2qv_sublim_tend + qv2qi_nucleat_tend) * (latvap+latice) * INV_CP +
(qr2qi_collect_tend + qc2qi_collect_tend + qc2qi_hetero_freeze_tend + qr2qi_immers_freeze_tend -
qi2qr_melt_tend + qc2qi_berg_tend) * latice * INV_CP) * dt);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This usage of the constant C::LatIce is what is causing non-bfb to latent_heat_fusion. Investigating.

Copy link
Contributor Author

@tcclevenger tcclevenger Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: LatIce is exactly BFB with latent_heat_fusion for the entire run. Tested with:

  for (int p=0; p<Spack::n; ++p) {
    uint64_t latice_bits = *reinterpret_cast<const uint64_t*>(&latice);
    uint64_t fusion_bits = *reinterpret_cast<const uint64_t*>(&latent_heat_fusion[p]);
    if (latice_bits != fusion_bits)
      printf("latice:%.17f (%" PRIu64 "):"
            "latent_heat_fusion:%.17f (%" PRIu64 ")\n",
             latice,latice_bits,latent_heat_fusion[p],fusion_bits);
  }

directly after the call to th_atm.set().

Not really sure what could be happening here.

Copy link
Contributor

@bartgol bartgol Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said on weaver we are bfb right? If you haven't already, you could try to run a SMS_Ln1 CIME case, forcing mac_aero_mic::subcycling=1, save a bunch of output fields; do this for both master and master+PR, and see which ones actually diff.

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6060
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS p3;bugfix
PULLREQUESTNUM 2998
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA fd86a15
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 25120ff
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: tcclevenger/fix_non_bfb_issues_with_latent_heat_in_p3
  • SHA: fd86a15
  • Mode: TEST_REPO

Pull Request Author: tcclevenger

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6060
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS p3;bugfix
PULLREQUESTNUM 2998
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA fd86a15
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 25120ff
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Weaver # 6060 FAILED (click to see last 100 lines of console output)

Warning: Permanently added the ECDSA host key for IP address '140.82.113.3' to the list of known hosts.
Submodule 'extern/Catch2' (git@github.com:E3SM-Project/Catch2) registered for path 'externals/ekat/extern/Catch2'
Submodule 'extern/kokkos' (git@github.com:E3SM-Project/kokkos) registered for path 'externals/ekat/extern/kokkos'
Submodule 'extern/spdlog' (git@github.com:gabime/spdlog.git) registered for path 'externals/ekat/extern/spdlog'
Submodule 'extern/yaml-cpp' (git@github.com:SNLComputation/yaml-cpp.git) registered for path 'externals/ekat/extern/yaml-cpp'
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/Catch2'...
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/kokkos'...
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/spdlog'...
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp'...
Warning: Permanently added the ECDSA host key for IP address '140.82.114.4' to the list of known hosts.
ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:SNLComputation/yaml-cpp.git' into submodule path '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp' failed
Failed to clone 'extern/yaml-cpp'. Retry scheduled
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp'...
ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:SNLComputation/yaml-cpp.git' into submodule path '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp' failed
Failed to clone 'extern/yaml-cpp' a second time, aborting
Failed to recurse into submodule path 'externals/ekat'

at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2846)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:2185)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl$7.lambda$execute$0(CliGitAPIImpl.java:1573)
at Jenkins v2.462.1//com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at Jenkins v2.462.1//com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
at Jenkins v2.462.1//com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at Jenkins v2.462.1//com.google.common.util.concurrent.DirectExecutorService.execute(DirectExecutorService.java:51)
at java.base/java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:184)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.submitRemainingCommand(GitCommandsExecutor.java:77)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.invokeAll(GitCommandsExecutor.java:70)

Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to weaver
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1826)
at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
at hudson.remoting.Channel.call(Channel.java:1042)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:153)
at jdk.internal.reflect.GeneratedMethodAccessor105.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:138)
at PluginClassLoader for git-client/jdk.proxy30/jdk.proxy30.$Proxy100.execute(Unknown Source)
at PluginClassLoader for git//hudson.plugins.git.extensions.impl.SubmoduleOption.onCheckoutCompleted(SubmoduleOption.java:196)
at PluginClassLoader for git//hudson.plugins.git.GitSCM.checkout(GitSCM.java:1388)
at hudson.scm.SCM.checkout(SCM.java:540)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1247)
at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:649)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:85)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:521)
at hudson.model.Run.execute(Run.java:1894)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
at hudson.model.ResourceController.execute(ResourceController.java:101)
at hudson.model.Executor.run(Executor.java:446)
Caused: hudson.plugins.git.GitException
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.checkResult(GitCommandsExecutor.java:89)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.invokeAll(GitCommandsExecutor.java:69)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.invokeAll(GitCommandsExecutor.java:47)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl$7.execute(CliGitAPIImpl.java:1576)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$GitCommandMasterToSlaveCallable.call(RemoteGitImpl.java:170)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$GitCommandMasterToSlaveCallable.call(RemoteGitImpl.java:161)
at hudson.remoting.UserRequest.perform(UserRequest.java:211)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:377)
at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused: java.io.IOException: Could not perform submodule update
at PluginClassLoader for git//hudson.plugins.git.extensions.impl.SubmoduleOption.onCheckoutCompleted(SubmoduleOption.java:201)
at PluginClassLoader for git//hudson.plugins.git.GitSCM.checkout(GitSCM.java:1388)
at hudson.scm.SCM.checkout(SCM.java:540)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1247)
at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:649)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:85)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:521)
at hudson.model.Run.execute(Run.java:1894)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
at hudson.model.ResourceController.execute(ResourceController.java:101)
at hudson.model.Executor.run(Executor.java:446)
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh
[SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins8739115007280394007.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: lbertag@sandia.gov
Finished: FAILURE

@AaronDonahue
Copy link
Contributor

@tcclevenger did you add WIP to this PR because the AT wasn't working? If so, I think we can unWIP it and add the RETEST label.

@tcclevenger
Copy link
Contributor Author

@tcclevenger did you add WIP to this PR because the AT wasn't working? If so, I think we can unWIP it and add the RETEST label.

No, I put this WIP since we don't want to merge it, I was just using it to track the issue. I could close it, but I didn't in case it turns out we actually need it. But I don't think that will be the case.

@bartgol bartgol removed the AT: WIP label Nov 1, 2024
@bartgol bartgol force-pushed the master branch 2 times, most recently from 7114e81 to 579850c Compare November 8, 2024 23:44
@bartgol bartgol force-pushed the master branch 3 times, most recently from 6318f41 to db5b35d Compare November 13, 2024 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix p3 regarding p3 microphysics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants