Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to CUB for Energy Accumulation #1474

Merged
merged 17 commits into from
Feb 7, 2025

Conversation

badisa
Copy link
Collaborator

@badisa badisa commented Feb 4, 2025

  • Avoid managing our own implementation of parallel sum, instead rely on CUB. I believe at the time of implementing our own version, I had convinced myself that CUB didn't support int128, which is not true.
  • Does not resolve issue seen while running timemachine under MPS
  • Appears that to be slightly slower than the old implementation, but within 1%. The reduced code maintenance seems worth it..

@badisa badisa added the deboggle label Feb 4, 2025
@badisa badisa changed the title Task/switch to cub for accumulation of energy Switch to CUB for Energy Accumulation Feb 5, 2025
@badisa badisa requested review from mcwitt and proteneer February 5, 2025 15:46
@badisa badisa marked this pull request as ready for review February 5, 2025 15:46
@mcwitt
Copy link
Collaborator

mcwitt commented Feb 5, 2025

Sanity check: have we verified that we get bitwise-identical results compared with master?

@badisa
Copy link
Collaborator Author

badisa commented Feb 5, 2025

Sanity check: have we verified that we get bitwise-identical results compared with master?

Yes, the nightly determinism tests pass

gpuErrchk(cudaMalloc(&d_sum_temp_storage_, sum_storage_bytes_));
};

FanoutSummedPotential::~FanoutSummedPotential() { gpuErrchk(cudaFree(d_sum_temp_storage_)); };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can DeviceBuffer be used for cases like this where the lifetime of the buffer is the same as the lifetime of the object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think that pattern is worth it per #1474 (comment)

Copy link
Collaborator

@mcwitt mcwitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks like a pretty straightforward improvement.

One comment: it would be nice to use the RAII pattern (i.e. DeviceBuffer) for d_sum_temp_storage_ (and other cases where buffer and object lifetime are always identical). Though maybe the reason not to was for consistency, since we have many other device buffers that are still manually allocated and freed?

@badisa
Copy link
Collaborator Author

badisa commented Feb 5, 2025

Nice! Looks like a pretty straightforward improvement.

One comment: it would be nice to use the RAII pattern (i.e. DeviceBuffer) for d_sum_temp_storage_ (and other cases where buffer and object lifetime are always identical). Though maybe the reason not to was for consistency, since we have many other device buffers that are still manually allocated and freed?

The RAII pattern is used here, just done on the per potential basis rather than the DeviceBuffer basis. I find the usage of the DeviceBuffer, as the GPU memory of potentials, to require me to track down where failures are coming from manually. Failures are reported in device_buffer.cu rather than the calling location. DeviceBuffer works well for short lived memory, but until we can report the top level location of the failure, it makes reporting and tracking down failures more difficult.

@mcwitt
Copy link
Collaborator

mcwitt commented Feb 5, 2025

require me to track down where failures are coming from manually

Ah, right, I'd forgotten about that (severe) limitation in the current implementation of DeviceBuffer.

For some future PR, I wonder if there might be a way to overcome this, e.g. if there were a way to attach a backtrace to a custom exception class (perhaps using some more modern C++ features?).

In any case, agreed that it doesn't seem worth it for now.

@@ -22,6 +22,9 @@ template <typename RealType> class CentroidRestraint : public Potential {
double kb_;
double b0_;

size_t sum_storage_bytes_;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still need the internal __int128 *d_u_buffer_;?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, do not.

@@ -65,12 +61,9 @@ void CentroidRestraint<RealType>::execute_device(
kb_,
b0_,
d_du_dx,
d_u == nullptr ? nullptr : d_u_buffer_);
d_u // Can write directly to the energy buffer for this potential.
Copy link
Owner

@proteneer proteneer Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: thinking about this more clearly this morning - I guess there's currently no way to have the semantics of "increment d_u" safely given the __int128 type, unless we support atomicAdds, so the current behavior of set d_u is probably correct.

old message:

I think this is pretty dangerous - the input API assumes (think?) that __int128 *d_u can be set to some initial value x, and execute_device can increment it by y resulting in x+y (like the forces). However, the actual internal kernel call directly sets d_u via:

//  k_centroid_restraint.cuh
if (t_idx == 0 && d_u) {
    RealType nrg = kb * (dij - b0) * (dij - b0);
    d_u[t_idx] = FLOAT_TO_FIXED_ENERGY<RealType>(nrg);
}

i.e. it blows away the old value of x - we should talk about the semantics of execute_device more carefully tmrw (i.e. behavior is to increment or to set)


block_energy_reduce<THREADS_PER_BLOCK>(block_energy_buffer, threadIdx.x);
// Sum's return value is only valid in thread 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting switch of the location of the __syncthreads() - from before the reduce to after the reduce - intentional?

Copy link
Owner

@proteneer proteneer Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. guessing that BlockReduce::Sum() implicitly calls a syncthreads before but not after?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, per the docs if you want to call a reduce multiple times, need to call __syncthreads() after each call.

Copy link
Owner

@proteneer proteneer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice changes!

@badisa badisa added the cr_cppcuda C++ and CUDA label Feb 6, 2025
@badisa badisa merged commit b1a3108 into master Feb 7, 2025
1 check passed
@badisa badisa deleted the task/switch-to-cub-for-accumulation-of-energy branch February 7, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants