-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdma: Eliminate unnecessary ctrl message waits in eager protocol #553
Conversation
This looks like it has real CI failures on P5 (all p5 builds failed):
|
I am able to reproduce this. It seems to only happen with user-registered buffers. I'm assuming it has to do with NCCL's loopback connection for user-reg. Investigating. |
I see a problem: our API still owns the endpoints (e.g., link), which doesn't work for the usage model proposed in this PR (since the endpoint needs to live until the deferred cleanup has completed). Since I anyway planned on moving endpoint ownership from the API to the protocol-specific communicators, I will do that as part of this PR. |
9db066d
to
7158a4f
Compare
7158a4f
to
70f39f5
Compare
Force-push to 70f39f5 addresses feedback. |
70f39f5
to
7ea8352
Compare
Force-push to 7ea8352 is a rebase on master. |
7ea8352
to
0a60be3
Compare
Force-push to 0a60be3 is a rebase on master. |
0a60be3
to
4ff8183
Compare
Changed to use |
4ff8183
to
9a35745
Compare
Removed |
Tests added for new functionality as well. Signed-off-by: Eric Raut <eraut@amazon.com>
For upcoming deferred comm close work Signed-off-by: Eric Raut <eraut@amazon.com>
Changes to the plugin will require changing the endpoint ownership from the API to the protocol-specific send/recv communicators. As a necessary step toward this, make the send/recv communicators responsible for freeing their associated endpoints. Signed-off-by: Eric Raut <eraut@amazon.com>
Fail send/recv if called on inactive communicator (after close) Signed-off-by: Eric Raut <eraut@amazon.com>
Communicators will be added to the cleanup list when NCCL calls `closeSend`/`closeRecv`. Closing the last communicator in the process will be blocking. Future commits will add a message to be sent on communicator close, which will be waited on here before destroying communicator resources. Signed-off-by: Eric Raut <eraut@amazon.com>
The close message will be sent on communicator close in an upcoming commit. Signed-off-by: Eric Raut <eraut@amazon.com>
The receive communicator, on closing, will send a message indicating it is safe for the send communicator to close resources. The sender will not destroy communicator resources until it receives this message. This will be required for the eager protocol when the sender does not wait for a control message before marking the send complete (upcoming commit). It prevents sender from closing the communicator before the receiver sends a control message, leading to an error on the receiver. Signed-off-by: Eric Raut <eraut@amazon.com>
Updates the send_close message to include a count of the number of control messages sent by receiver. Sender will wait to receive all control messages before closing resources. This handles the unlikely case where completions are reordered such that the close message arrives before a control message. Signed-off-by: Eric Raut <eraut@amazon.com>
Sender will not wait to receive a control message from receiver for eager messages, and receiver will not send a control message if it has already received eager data. Both communicators will await any outstanding operations on communicator close. Signed-off-by: Eric Raut <eraut@amazon.com>
9a35745
to
28acdf2
Compare
Force-push to 28acdf2 is a rebase on master. |
Eliminating the waits for eager messages causes sender and receiver to go out of sync. A patch intended to keep them in sync (2479540) also did not resolve the issue. A proper design fix is still in progress; disabling the waits until then. The supporting refactoring commits of [aws#553](aws#553) are retained so we can re-enable this feature later. Only the last commit is reverted. The sync patch is also reverted. This reverts commits 2479540 and 7a0fa83. Signed-off-by: Eric Raut <eraut@amazon.com>
Eliminating the waits for eager messages causes sender and receiver to go out of sync. A patch intended to keep them in sync (2479540) also did not resolve the issue. A proper design fix is still in progress; re-enabling the waits until then. The supporting refactoring commits of [aws#553](aws#553) are retained so we can re-enable this feature later. Only the last commit is reverted. The sync patch is also reverted. This reverts commits 2479540 and 7a0fa83. Signed-off-by: Eric Raut <eraut@amazon.com>
Eliminating the waits for eager messages causes sender and receiver to go out of sync. A patch intended to keep them in sync (2479540) also did not resolve the issue. A proper design fix is still in progress; re-enabling the waits until then. The supporting refactoring commits of aws#553 are retained so we can re-enable this feature later. Only the last commit is reverted. The sync patch is also reverted. This reverts commits 2479540 and 7a0fa83. Signed-off-by: Eric Raut <eraut@amazon.com>
Eliminating the waits for eager messages can cause the sender and receiver to go out-of-sync during a long run of eager messages. For example, at an extreme, the sender could send 1024 eager messages, which are queued by Libfabric and marked complete. Then if the sender received a control message for message 0 from the receiver, it couldn't be unambiguously associated with the first message or the 1024th message, which both have the same sequence number, since the sequence number is only 10 bits. A patch intended to keep them in sync (2479540) also did not resolve the issue. Without `FI_ORDER_SAS` semantics (which the plugin does not request), the completions can be reordered, and the receipt of of sync ctrl message does not guarantee that previous ctrl messages have completed, which can lead to the same problem. A proper design fix is still in progress; this PR re-enables the eager waits in the interim. The supporting refactoring commits of #553 are retained so we can re-enable this feature later. Only the last commit is reverted. The sync patch is also reverted. This reverts commits 2479540 and 7a0fa83. Signed-off-by: Eric Raut <eraut@amazon.com>
Sender will not wait to receive a control message from receiver for
eager messages, and receiver will not send a control message if it has
already received eager data.
Both communicators will await any outstanding operations on communicator
close.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.