-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to CUB for Energy Accumulation #1474
Conversation
badisa
commented
Feb 4, 2025
•
edited
Loading
edited
- Avoid managing our own implementation of parallel sum, instead rely on CUB. I believe at the time of implementing our own version, I had convinced myself that CUB didn't support int128, which is not true.
- Does not resolve issue seen while running timemachine under MPS
- Appears that to be slightly slower than the old implementation, but within 1%. The reduced code maintenance seems worth it..
* Still missing in barostat
* Also correctly use it in the pairlist
Sanity check: have we verified that we get bitwise-identical results compared with master? |
Yes, the nightly determinism tests pass |
gpuErrchk(cudaMalloc(&d_sum_temp_storage_, sum_storage_bytes_)); | ||
}; | ||
|
||
FanoutSummedPotential::~FanoutSummedPotential() { gpuErrchk(cudaFree(d_sum_temp_storage_)); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can DeviceBuffer
be used for cases like this where the lifetime of the buffer is the same as the lifetime of the object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't think that pattern is worth it per #1474 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks like a pretty straightforward improvement.
One comment: it would be nice to use the RAII pattern (i.e. DeviceBuffer
) for d_sum_temp_storage_
(and other cases where buffer and object lifetime are always identical). Though maybe the reason not to was for consistency, since we have many other device buffers that are still manually allocated and freed?
The RAII pattern is used here, just done on the per potential basis rather than the |
Ah, right, I'd forgotten about that (severe) limitation in the current implementation of For some future PR, I wonder if there might be a way to overcome this, e.g. if there were a way to attach a backtrace to a custom exception class (perhaps using some more modern C++ features?). In any case, agreed that it doesn't seem worth it for now. |
@@ -22,6 +22,9 @@ template <typename RealType> class CentroidRestraint : public Potential { | |||
double kb_; | |||
double b0_; | |||
|
|||
size_t sum_storage_bytes_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you still need the internal __int128 *d_u_buffer_;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, do not.
@@ -65,12 +61,9 @@ void CentroidRestraint<RealType>::execute_device( | |||
kb_, | |||
b0_, | |||
d_du_dx, | |||
d_u == nullptr ? nullptr : d_u_buffer_); | |||
d_u // Can write directly to the energy buffer for this potential. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: thinking about this more clearly this morning - I guess there's currently no way to have the semantics of "increment d_u" safely given the __int128
type, unless we support atomicAdds
, so the current behavior of set d_u
is probably correct.
old message:
I think this is pretty dangerous - the input API assumes (think?) that __int128 *d_u
can be set to some initial value x
, and execute_device
can increment it by y
resulting in x+y
(like the forces). However, the actual internal kernel call directly sets d_u
via:
// k_centroid_restraint.cuh
if (t_idx == 0 && d_u) {
RealType nrg = kb * (dij - b0) * (dij - b0);
d_u[t_idx] = FLOAT_TO_FIXED_ENERGY<RealType>(nrg);
}
i.e. it blows away the old value of x
- we should talk about the semantics of execute_device
more carefully tmrw (i.e. behavior is to increment or to set)
|
||
block_energy_reduce<THREADS_PER_BLOCK>(block_energy_buffer, threadIdx.x); | ||
// Sum's return value is only valid in thread 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting switch of the location of the __syncthreads()
- from before the reduce to after the reduce - intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. guessing that BlockReduce::Sum() implicitly calls a syncthreads before but not after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, per the docs if you want to call a reduce multiple times, need to call __syncthreads()
after each call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice changes!