Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdma: Eliminate unnecessary ctrl message waits in eager protocol #553

Merged
merged 9 commits into from
Sep 11, 2024

Conversation

rauteric
Copy link
Contributor

@rauteric rauteric commented Sep 3, 2024

Sender will not wait to receive a control message from receiver for
eager messages, and receiver will not send a control message if it has
already received eager data.

Both communicators will await any outstanding operations on communicator
close.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rauteric rauteric requested a review from a team as a code owner September 3, 2024 23:59
@a-szegel
Copy link
Contributor

a-szegel commented Sep 4, 2024

This looks like it has real CI failures on P5 (all p5 builds failed):

scatter_perf: nccl_ofi_rdma.c:254: get_rail: Assertion `ep->rails' failed.
[ip-172-31-7-104:29380] *** Process received signal ***
[ip-172-31-7-104:29380] Signal: Aborted (6)
[ip-172-31-7-104:29380] Signal code:  (-6)

@rauteric
Copy link
Contributor Author

rauteric commented Sep 4, 2024

This looks like it has real CI failures on P5 (all p5 builds failed):

scatter_perf: nccl_ofi_rdma.c:254: get_rail: Assertion `ep->rails' failed.
[ip-172-31-7-104:29380] *** Process received signal ***
[ip-172-31-7-104:29380] Signal: Aborted (6)
[ip-172-31-7-104:29380] Signal code:  (-6)

I am able to reproduce this. It seems to only happen with user-registered buffers. I'm assuming it has to do with NCCL's loopback connection for user-reg. Investigating.

@rauteric
Copy link
Contributor Author

rauteric commented Sep 4, 2024

This looks like it has real CI failures on P5 (all p5 builds failed):

scatter_perf: nccl_ofi_rdma.c:254: get_rail: Assertion `ep->rails' failed.
[ip-172-31-7-104:29380] *** Process received signal ***
[ip-172-31-7-104:29380] Signal: Aborted (6)
[ip-172-31-7-104:29380] Signal code:  (-6)

I am able to reproduce this. It seems to only happen with user-registered buffers. I'm assuming it has to do with NCCL's loopback connection for user-reg. Investigating.

I see a problem: our API still owns the endpoints (e.g., link), which doesn't work for the usage model proposed in this PR (since the endpoint needs to live until the deferred cleanup has completed).

Since I anyway planned on moving endpoint ownership from the API to the protocol-specific communicators, I will do that as part of this PR.

@rauteric rauteric force-pushed the ctrl_nowait_refactor branch from 9db066d to 7158a4f Compare September 4, 2024 22:57
@rauteric
Copy link
Contributor Author

rauteric commented Sep 4, 2024

Force-push to 7158a4f should fix the CI failure (link)

@rauteric rauteric force-pushed the ctrl_nowait_refactor branch from 7158a4f to 70f39f5 Compare September 5, 2024 23:27
@rauteric
Copy link
Contributor Author

rauteric commented Sep 5, 2024

Force-push to 70f39f5 addresses feedback.

@rauteric rauteric force-pushed the ctrl_nowait_refactor branch from 70f39f5 to 7ea8352 Compare September 5, 2024 23:32
@rauteric
Copy link
Contributor Author

rauteric commented Sep 5, 2024

Force-push to 7ea8352 is a rebase on master.

@rauteric rauteric dismissed bwbarrett’s stale review September 5, 2024 23:35

Addressed feedback

@rauteric rauteric force-pushed the ctrl_nowait_refactor branch from 7ea8352 to 0a60be3 Compare September 9, 2024 19:06
@rauteric
Copy link
Contributor Author

rauteric commented Sep 9, 2024

Force-push to 0a60be3 is a rebase on master.

@rauteric rauteric force-pushed the ctrl_nowait_refactor branch from 0a60be3 to 4ff8183 Compare September 9, 2024 23:13
@rauteric
Copy link
Contributor Author

rauteric commented Sep 9, 2024

Changed to use *_comm_destroy methods in case of error (before connection is established), and renamed *_close name to *_close_deferred for clarity.

@rauteric rauteric force-pushed the ctrl_nowait_refactor branch from 4ff8183 to 9a35745 Compare September 10, 2024 19:17
@rauteric
Copy link
Contributor Author

Removed set_send_ctrl_completed and folded behavior into process_completions.

Tests added for new functionality as well.

Signed-off-by: Eric Raut <eraut@amazon.com>
For upcoming deferred comm close work

Signed-off-by: Eric Raut <eraut@amazon.com>
Changes to the plugin will require changing the endpoint ownership from
the API to the protocol-specific send/recv communicators. As a necessary
step toward this, make the send/recv communicators responsible for
freeing their associated endpoints.

Signed-off-by: Eric Raut <eraut@amazon.com>
Fail send/recv if called on inactive communicator (after close)

Signed-off-by: Eric Raut <eraut@amazon.com>
Communicators will be added to the cleanup list when NCCL calls
`closeSend`/`closeRecv`. Closing the last communicator in the process
will be blocking.

Future commits will add a message to be sent on communicator close,
which will be waited on here before destroying communicator resources.

Signed-off-by: Eric Raut <eraut@amazon.com>
The close message will be sent on communicator close in an upcoming
commit.

Signed-off-by: Eric Raut <eraut@amazon.com>
The receive communicator, on closing, will send a message indicating it
is safe for the send communicator to close resources. The sender will
not destroy communicator resources until it receives this message.

This will be required for the eager protocol when the sender does not
wait for a control message before marking the send complete (upcoming
commit). It prevents sender from closing the communicator before the
receiver sends a control message, leading to an error on the receiver.

Signed-off-by: Eric Raut <eraut@amazon.com>
Updates the send_close message to include a count of the number of
control messages sent by receiver. Sender will wait to receive all
control messages before closing resources.

This handles the unlikely case where completions are reordered such that
the close message arrives before a control message.

Signed-off-by: Eric Raut <eraut@amazon.com>
Sender will not wait to receive a control message from receiver for
eager messages, and receiver will not send a control message if it has
already received eager data.

Both communicators will await any outstanding operations on communicator
close.

Signed-off-by: Eric Raut <eraut@amazon.com>
@rauteric rauteric force-pushed the ctrl_nowait_refactor branch from 9a35745 to 28acdf2 Compare September 10, 2024 19:17
@rauteric
Copy link
Contributor Author

Force-push to 28acdf2 is a rebase on master.

@rauteric rauteric merged commit 7a0fa83 into aws:master Sep 11, 2024
33 checks passed
@rauteric rauteric deleted the ctrl_nowait_refactor branch September 11, 2024 17:54
rauteric added a commit to rauteric/aws-ofi-nccl that referenced this pull request Oct 16, 2024
Eliminating the waits for eager messages causes sender and receiver to
go out of sync. A patch intended to keep them in sync (2479540) also did
not resolve the issue. A proper design fix is still in progress;
disabling the waits until then.

The supporting refactoring commits of
[aws#553](aws#553) are retained so we
can re-enable this feature later. Only the last commit is reverted. The
sync patch is also reverted.

This reverts commits 2479540 and
7a0fa83.

Signed-off-by: Eric Raut <eraut@amazon.com>
rauteric added a commit to rauteric/aws-ofi-nccl that referenced this pull request Oct 16, 2024
Eliminating the waits for eager messages causes sender and receiver to
go out of sync. A patch intended to keep them in sync (2479540) also did
not resolve the issue. A proper design fix is still in progress;
re-enabling the waits until then.

The supporting refactoring commits of
[aws#553](aws#553) are retained so we
can re-enable this feature later. Only the last commit is reverted. The
sync patch is also reverted.

This reverts commits 2479540 and
7a0fa83.

Signed-off-by: Eric Raut <eraut@amazon.com>
rauteric added a commit to rauteric/aws-ofi-nccl that referenced this pull request Oct 16, 2024
Eliminating the waits for eager messages causes sender and receiver to
go out of sync. A patch intended to keep them in sync (2479540) also did
not resolve the issue. A proper design fix is still in progress;
re-enabling the waits until then.

The supporting refactoring commits of
aws#553 are retained so we
can re-enable this feature later. Only the last commit is reverted. The
sync patch is also reverted.

This reverts commits 2479540 and
7a0fa83.

Signed-off-by: Eric Raut <eraut@amazon.com>
bwbarrett pushed a commit that referenced this pull request Oct 17, 2024
Eliminating the waits for eager messages can cause the sender and
receiver to go out-of-sync during a long run of eager messages. For
example, at an extreme, the sender could send 1024 eager messages, which
are queued by Libfabric and marked complete. Then if the sender received
a control message for message 0 from the receiver, it couldn't be
unambiguously associated with the first message or the 1024th message,
which both have the same sequence number, since the sequence number is
only 10 bits.

A patch intended to keep them in sync (2479540) also did not resolve the
issue. Without `FI_ORDER_SAS` semantics (which the plugin does not
request), the completions can be reordered, and the receipt of of sync
ctrl message does not guarantee that previous ctrl messages have
completed, which can lead to the same problem.

A proper design fix is still in progress; this PR re-enables the eager
waits in the interim.

The supporting refactoring commits of
#553 are retained so we
can re-enable this feature later. Only the last commit is reverted. The
sync patch is also reverted.

This reverts commits 2479540 and
7a0fa83.

Signed-off-by: Eric Raut <eraut@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants