Skip to content

Commit

Permalink
rdma: Skip eager protocol when writes outstanding
Browse files Browse the repository at this point in the history
The LL128 protocol causes an interesting pattern where there's a
long string of large-ish messages and then a "leftover", most
notable on power of 2 message sizes like are used in benchmarks.
If that leftover bit is eager sized, you end up with an eager
message queued behind a bunch of writes, which doesn't gain
you anything in latency hiding the receive request, but then
costs you the fixup on the remote side.

This patch implements an overly simplistic version of runting,
which skips eager if there is already a write outstanding on
the communicator.  This is wrong for a couple reasons.  First,
we should probably use a BDP or similar to determine how much
data should be in flight before we skip the eager protocol.
Second, we should really decrement the counter on cqe processing
time, but that is a locking mess.  And finally, it should really
be a domain counter instead of a communicator counter, but again,
see the locking mess comment.  The NCCL communication patterns
are generally simple enough that none of these really seem
to matter, so we go with simplicity for now.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 43b7bcd)
  • Loading branch information
bwbarrett authored and arunkarthik-akkart committed Dec 5, 2024
1 parent 1779045 commit eeadb5a
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 1 deletion.
2 changes: 2 additions & 0 deletions include/nccl_ofi_rdma.h
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,8 @@ typedef struct nccl_net_ofi_rdma_send_comm {
nccl_net_ofi_send_comm_t base;

uint64_t num_inflight_reqs;
uint64_t num_inflight_writes;

nccl_ofi_freelist_t *nccl_ofi_reqs_fl;

/* Comm ID provided by the local endpoint */
Expand Down
19 changes: 18 additions & 1 deletion src/nccl_ofi_rdma.c
Original file line number Diff line number Diff line change
Expand Up @@ -2074,6 +2074,18 @@ static inline int free_send_req(nccl_net_ofi_rdma_req_t *req,

send_data = get_send_data(req);

if (!send_data->eager) {
/* free is going to be called inside of test(), which will
happen in a time when NCCL guarantees no other thread will
be accessing the communicator. So no mutex protections are
required if we do it here. Better would be to do this as
soon as we get the CQE for this request, but that would
require atomics or locks, which isn't worth it today. But
if we ever refactor the locking strategy, we should revisit
this. */
(s_comm->num_inflight_writes)--;
}

if (send_data->schedule) {
nccl_net_ofi_rdma_device_t *device = rdma_req_get_device(req);
nccl_net_ofi_release_schedule(device->scheduler, send_data->schedule);
Expand Down Expand Up @@ -5857,7 +5869,7 @@ static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int t

/* Determine if this should be sent eagerly. */
eager = false;
if ((!have_ctrl && (size_t)size <= eager_max_size) || (size == 0)) {
if ((!have_ctrl && (size_t)size <= eager_max_size && s_comm->num_inflight_writes == 0) || (size == 0)) {
eager = true;
}

Expand Down Expand Up @@ -5897,6 +5909,10 @@ static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int t
*/
(s_comm->num_inflight_reqs)++;

if (!eager) {
(s_comm->num_inflight_writes)++;
}

NCCL_OFI_TRACE_SEND(req->dev_id, size, s_comm, msg_seq_num, req, base_req);

/* Try posting RDMA write for received RDMA control messages */
Expand Down Expand Up @@ -5958,6 +5974,7 @@ static int send_close_deferred(nccl_net_ofi_send_comm_t *send_comm)
ret = -EINVAL;
goto exit;
}
assert (s_comm->num_inflight_writes == 0);

s_comm->comm_active = false;

Expand Down

0 comments on commit eeadb5a

Please sign in to comment.