Skip to content

Commit

Permalink
param: autogenerate topology file by default for RDMA protocol
Browse files Browse the repository at this point in the history
Instead of using a static topology file. An autogenerated topology file
will be used when the user does not explicitly set a topology file
(`NCCL_TOPO_FILE`) and uses the RDMA protocol. For platform-aws builds,
this applies to the P5 platform.

Signed-off-by: Eric Raut <eraut@amazon.com>
  • Loading branch information
rauteric committed Aug 12, 2024
1 parent 01ef439 commit 64ccd78
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 19 deletions.
7 changes: 0 additions & 7 deletions include/nccl_ofi_param.h
Original file line number Diff line number Diff line change
Expand Up @@ -162,13 +162,6 @@ OFI_NCCL_PARAM_STR(protocol, "PROTOCOL", NULL);

OFI_NCCL_PARAM_INT(domain_per_thread, "DOMAIN_PER_THREAD", -1);

/*
* When enabled and RDMA communication protocol is used, write NCCL
* topology file and set environment variable `NCCL_TOPO_FILE`. OFI plugin
* writes the NCCL topology file to a memfd file.
*/
OFI_NCCL_PARAM_INT(topo_file_write_enable, "TOPO_FILE_WRITE_ENABLE", 0);

/*
* Disable the native RDMA write support check when using the "RDMA" protocol
* for send/recv operations on AWS platforms. When the check is disabled, the
Expand Down
13 changes: 2 additions & 11 deletions src/nccl_ofi_rdma.c
Original file line number Diff line number Diff line change
Expand Up @@ -247,10 +247,8 @@ static inline struct fid_domain *get_domain_from_endpoint(nccl_net_ofi_rdma_ep_t
/*
* @brief Write topology to NCCL topology file
*
* If environment variable `OFI_NCCL_TOPO_FILE_WRITE_ENABLE` is set,
* this function writes a NCCL topology file to a memfd file.
*
* It also sets environment variable `NCCL_TOPO_FILE` to the
* This function writes a NCCL topology file to a memfd file, and
* sets environment variable `NCCL_TOPO_FILE` to the
* filename path of topology file.
*
* @param topo
Expand All @@ -264,13 +262,6 @@ static int write_topo_file(nccl_ofi_topo_t *topo)
int topo_fd = -1;
FILE *file = NULL;

/* This function is a no-op in case writing topology file is not enabled explicitly */
if (!ofi_nccl_topo_file_write_enable()) {
NCCL_OFI_TRACE(NCCL_INIT | NCCL_NET,
"Topology write not enabled; skipping");
goto exit;
}

/**
* If `NCCL_TOPO_FILE` is already set, don't set it again.
*
Expand Down
2 changes: 1 addition & 1 deletion src/platform-aws.c
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ struct ec2_platform_data {
},
{
.name = "p5.48xlarge",
.topology = "p5.48xl-topo.xml",
.topology = NULL,
.default_dup_conns = 0,
.latency = 75.0,
.gdr_required = true,
Expand Down

0 comments on commit 64ccd78

Please sign in to comment.