-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Omp offload #165
base: develop
Are you sure you want to change the base?
Omp offload #165
Conversation
Merge branch 'omp_offload' of https://github.com/SOLLVE/AutoDock-GPU into omp_offload
@vlkale @mathialakan Thank you. I look forward testing it out and working on it. |
@atillack Please let us know about input testing sets you are using as we are curious about that. |
@vlkale I think you may be referring to the E50 plots, which are described in J. Chem. Theory Comput. 2021, 17, 2, 1060–1073. See for example #139 (comment). |
Thanks for this earlier and sorry I didn't follow up and reply here sooner. @mathialakan and I looked at this and it is helpful to us. We wanted a representative and reasonable input data set that we can use to do more tests of this fork. I have spoken to @mathialakan and he may have a few more things to say here.
|
This pull request contains the following proposed contributions to AutoDock-GPU:
(1) OpenMP parallelization of the AutoDock-GPU application's work on a GPU instead of CUDA parallelization of work on a GPU and experimentation on Summit with LLVM 14's OpenMP implementation.
(2) The use of OpenMP parallelization of the AutoDock-GPU application's work to multiple GPUs of a node through (i) having multiple threads on the CPU invoke target regions to the GPU specified through the device clause on the target region (either at compile-time or runtime) and (ii) using a task-to-device scheduling library that allows a thread to dynamically select at runtime the device to run on based on the state of the GPUs (we focus on occupancy and load). We note that a task-to-GPU scheduling strategies are based on those being developed in the SOLLVE project for a variety of applications.
On Summit with 3 GPUs: Our original OpenMP version of AutoDock-GPU with a round-robin scheduling generated the correct results compared to the CUDA version. Through performance optimizations over our original OpenMP version in the last 3 weeks and using a task-to-GPU scheduling strategy that picks a random GPU, we have gotten the OpenMP version to improve 11x. The OpenMP version is still 4x slower than the CUDA version. Using nvprof on both versions, we see that most (~99%) of the application's execution time is spent in kernel3's 'gpu_perform_LS_kernel(float*, float*). We think our OpenMP version is slower primarily because of an manually optimized simd reduction that the CUDA version is using - we are having trouble at the moment translating it efficiently in OpenMP version (we just use the reduction in the target directive). We are looking to use the invoking CUDA code for the reduction within the OpenMP target region if it's possible, and also considering just invoking all CUDA code within the OpenMP target regions to see what performance we get doing that, among other optimizations to the OpenMP version.
Nevertheless, we believe that these changes hold significant promise for (a) the use of OpenMP in AutoDock-GPU and for (b) using multi-device scheduling for load balancing in particular through the use of OpenMP and its tasking capabilities.
Consider this an initial pull request to inform of our development. We expect more updates to this pull request in the coming few weeks.
We have been in contact with @atillack and Stefano Forli from Scripps in the last couple of months. Please reach out to Mathialakan @mathialakan or myself @vlkale for questions on the code.