Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory requirements for large-particle jobs after AMReX update #769

Closed
cemitch99 opened this issue Dec 12, 2024 · 3 comments
Closed
Assignees
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: third party Changes in ImpactX that reflect a change in a third-party library

Comments

@cemitch99
Copy link
Member

After updating AMReX dependencies (as described in PR#760), a 1B-particle ImpactX job can no longer run because it exceeds GPU memory requirements. To illustrate this, input and output are included below (excluding particle-based diagnostic output to save space). Three cases are included: 1) running the job using ImpactX before updating the AMReX dependencies, 2) running the job after updating the dependencies (as-is), and 3) running the job after updating the dependencies and manually setting the_arena_init_size. Here 1) runs successfully, 2) fails immediately due to a memory error, and 3) starts successfully but eventually fails.

Test1BParticles.zip

@ax3l ax3l added bug Something isn't working bug: affects latest release Bug also exists in latest release version component: third party Changes in ImpactX that reflect a change in a third-party library labels Jan 3, 2025
@ax3l
Copy link
Member

ax3l commented Jan 3, 2025

@WeiqunZhang and @RemiLehe identified this as a regression caused by AMReX-Codes/amrex#4175

A current work-around is to manually set

amrex.the_arena_init_size=40000000000

with GPU-aware MPI.

Additional to the memory overhead this introduced, and using this work-around, the runtime that @cemitch99 observes is significantly higher than before.

@atmyers
Copy link
Contributor

atmyers commented Jan 8, 2025

I believe this PR will fix this in AMReX: AMReX-Codes/amrex#4286

atmyers added a commit to AMReX-Codes/amrex that referenced this issue Jan 9, 2025
For mesh data it is better to use a small, separate Arena for
communication. But, we found that for particle communication this
approach uses too much memory (see the ImpactX Issue here:
BLAST-ImpactX/impactx#769)

This restores the old default behavior (prior to PR #4175), but allows
users to opt-in to using The_Comms_Arena if desired.

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate
@ax3l
Copy link
Member

ax3l commented Jan 10, 2025

Fixed via #791 - will be part of ImpactX 25.02.

Thanks for your help, @atmyers ! 🙏

@ax3l ax3l closed this as completed Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: third party Changes in ImpactX that reflect a change in a third-party library
Projects
None yet
Development

No branches or pull requests

4 participants