Skip to content

Commit

Permalink
rdma: fixed accept() error path
Browse files Browse the repository at this point in the history
In case of error, accept() tries to close the communicator, and that operation
was overwriting the return value (i.e., the original error code).
This means that in case of error, if the close is successful, accept didn't
return an error code, leading to infinite retries.
This also fixes a typo.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
  • Loading branch information
AmedeoSapio committed Mar 17, 2024
1 parent fc8f137 commit 1e1ffc9
Showing 1 changed file with 7 additions and 6 deletions.
13 changes: 7 additions & 6 deletions src/nccl_ofi_rdma.c
Original file line number Diff line number Diff line change
Expand Up @@ -3895,12 +3895,13 @@ static int accept(nccl_net_ofi_listen_comm_t *listen_comm,
ret = -EINVAL;
}

exit:

exit:;
/* Close receive communicator in case listen operation failed */
ret = close_listen_recv_comm(l_comm);

return ret;
int close_ret = close_listen_recv_comm(l_comm);
if (close_ret) {
NCCL_OFI_WARN("Failed to close listen communicator");
}
return ret ? ret : close_ret;
}

static int listen_close(nccl_net_ofi_listen_comm_t *listen_comm)
Expand Down Expand Up @@ -6050,7 +6051,7 @@ int nccl_net_ofi_rdma_init(const char *provider_filter,

goto exit;

error:;
error:
if (base_devs) {
for (nccl_net_ofi_device_t **base_dev = base_devs; base_dev != base_devs + num_devs; ++base_dev) {
nccl_net_ofi_rdma_device_t *device =
Expand Down

0 comments on commit 1e1ffc9

Please sign in to comment.