-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU trsm performance drops with large block sizes #30
Comments
This is normal: when N=NB (extreme case), you have absolutely no parallelism. In general, you want NB in the region of 2k on P100 GPUs, but if you are running a very small problem, the best NB may be smaller (balance between kernel efficiency, and algorithm parallelism). |
This is normal behavior. |
Original comment by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ). Thanks for the response. Since I don’t know much of how dplasma works internally, could you explain why you say there is no parallelism when |
The TRSM algorithm in DPLASMA is written in blocked version. We first do the TRSM on the leftmost column (TRSM on CPU), then apply the update with GEMM on the remainder of the matrix (GEMM on GPU). With NB=N, the only call is a single serial TRSM on CPU. It would probably make sense for us to revisit and execute the TRSMs on GPU as well (in this case it would degenerate into running a single CUBLAS TRSM for N=NB). This was designed when CUBLAS was very slow on most non-gemm operations, this is not really a problem anymore. |
Reopen because the behavior is 'as expected' but is sub-optimal, and there is not real reason to not call CUBLAS in all cases anymore. |
Original comment by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ). Thanks for the explanation, the performance drop makes a lot more sense if it ends up running on the CPU! |
Original report by Mikael Simberg (Bitbucket: [Mikael Simberg](https://bitbucket.org/Mikael Simberg), ).
I’m comparing the performance of dplasma with other libraries on GPUs, and I’m particularly looking at trsm at the moment. I see performance initially increase with the block size until it reaches a good fraction of the peak flops of the GPU, but after that performance drops significantly. More concretely, I’m running dplasma on a single node with a P100 GPU, 12 core Haswell CPU (Piz Daint GPU partition), built in release mode with GCC 8.3, CUDA 10.2 (I pass no additional options to the CMake configuration except the build type) and get the following results:
Is this expected behaviour? What could explain the drop? Is there something in the configuration that could be causing this? I don’t actually expect to be running with huge block sizes (especially
B == N
), but I wouldn’t have expected performance to start dropping already atNB = MB = 1024
.The text was updated successfully, but these errors were encountered: