Skip to content

Commit

Permalink
Reapply "defaults: make dmabuf opt-in"
Browse files Browse the repository at this point in the history
This reverts commit 224593f.

Our shared development cluster seems to have issues with dmabuf when
running NCCL tests, for a handful of niche situations, ie: two nodes,
with MPI_Comm_split equal to the number of GPUs, at 16GB+. Other
environments seem not to have issues with the same workload, but out of
an abundance of caution and due to a lack of root cause, this is being
reverted again.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
(cherry picked from commit 1a46a67)
  • Loading branch information
Nicholas Sielicki authored and arunkarthik-akkart committed Dec 5, 2024
1 parent e9f44eb commit 8b67422
Showing 1 changed file with 5 additions and 4 deletions.
9 changes: 5 additions & 4 deletions include/nccl_ofi_param.h
Original file line number Diff line number Diff line change
Expand Up @@ -272,14 +272,15 @@ OFI_NCCL_PARAM_INT(disable_gdr_required_check, "DISABLE_GDR_REQUIRED_CHECK", 0);
* Unfortunately, the plugin needs to signal DMABUF support or lack thereof back
* to NCCL prior to having an opportuntiy to make any any memory registrations.
* This ultimately means that the plugin will opimistically assume DMA-BUF is
* viable on all FI_HMEM providers beyond libfabric 1.20.
* viable on all FI_HMEM providers beyond libfabric 1.20, if not for this param.
*
* If dmabuf registrations fail, (ie: if ibv_reg_dmabuf_mr cannot be resolved),
* the plugin has no freedom to renegotiate DMABUF support with NCCL, and so it
* is fatal. Under those conditions, users should set this environment variable
* to force NCCL to avoid providing dmabuf file desciptors.
* is fatal. Under those conditions, users should ensure that they have set this
* environment variable to '1' to force NCCL to avoid providing dmabuf file
* desciptors. This is the default, pending perf investigations.
*/
OFI_NCCL_PARAM_INT(disable_dmabuf, "DISABLE_DMABUF", 0);
OFI_NCCL_PARAM_INT(disable_dmabuf, "DISABLE_DMABUF", 1);

/*
* Messages sized larger than this threshold will be striped across multiple rails
Expand Down

0 comments on commit 8b67422

Please sign in to comment.