Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate endpoint for control messages #543

Merged
merged 3 commits into from
Sep 8, 2024

Conversation

rajachan
Copy link
Member

Rebased and reworked #321 on master.

The commits are largely the same with adaptations to recent changes around MR (with the addition of the cache) and fix for a bug that was causing an fi_av_insert() on the control QP's address to fail (as the conn message did not carry the correct address).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rajachan rajachan requested a review from a team as a code owner August 26, 2024 03:51
@rajachan rajachan force-pushed the control-qp-rebase branch 2 times, most recently from 6898e99 to ca7884f Compare August 26, 2024 04:06
@a-szegel
Copy link
Contributor

bot:aws:retest

@bwbarrett
Copy link
Contributor

The CI failure looks real; it looks like tracing broke.

@rauteric
Copy link
Contributor

The previous PR (#321 (comment)) had some unresolved perf issues. Have we resolved those at this point?

@rajachan
Copy link
Member Author

The previous PR (#321 (comment)) had some unresolved perf issues. Have we resolved those at this point?

I tested up to 16 nodes and did not see any perf issues with the series. Was there any specific case where we saw issues earlier, @bwbarrett ?

@bwbarrett
Copy link
Contributor

The previous PR (#321 (comment)) had some unresolved perf issues. Have we resolved those at this point?

I tested up to 16 nodes and did not see any perf issues with the series. Was there any specific case where we saw issues earlier, @bwbarrett ?

Large message Allreduce, but we've cleaned up a bunch of Libfabric and EFA issues since then, so if we don't see changes, we don't see changes.

@rajachan rajachan force-pushed the control-qp-rebase branch 2 times, most recently from dcef637 to 0f0343c Compare August 28, 2024 17:11
@rajachan
Copy link
Member Author

Comments addressed and tracing breakage fixed. The previous round of AWS-CI runs had a functional test timeout, will watch this run and test locally to make sure that wasn't CI infra being CI infra.

Copy link
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please squash out the merge commit before merging. 🙂

@a-szegel
Copy link
Contributor

bot:aws:retest

@a-szegel
Copy link
Contributor

a-szegel commented Sep 3, 2024

rebased via GUI to re-kick off PR CI.

@a-szegel
Copy link
Contributor

a-szegel commented Sep 4, 2024

bot:aws:retest

rauteric
rauteric previously approved these changes Sep 6, 2024
@rajachan
Copy link
Member Author

rajachan commented Sep 7, 2024

CI cleared, this is ready to go.

The control messages are not going to be scheduled over multiple rails,
so remove the unnecessary logic around it.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Signed-off-by: Raghu Raja <raghunch@amazon.com>
For the long message RDMA protocol, we want to make sure that we
never starve the sender for data to move, which means prioritizing
control messages from the receiver to the sender.  This patch moves
both the communicator setup and recv control messages to a new
endpoint, which is always on device rail 0.  Future patches will
optimize polling of the control message cq in the send path
and setting priority bits on the control cq.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Signed-off-by: Raghu Raja <raghunch@amazon.com>
If a match isn't found for the current send, poll the control cq
to see if the match can be found.  While this extends the current
send() call, it potentially lowers the time until data transfer
starts.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Signed-off-by: Raghu Raja <raghunch@amazon.com>
@rajachan rajachan merged commit 7fc2122 into aws:master Sep 8, 2024
33 checks passed
@rajachan rajachan deleted the control-qp-rebase branch September 8, 2024 05:17
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 18, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so this is adding an environment variable to optionally enable
the separate control message endpoint, but for now we use as a default the old
behavior of sending the control message round-robining across rails.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 18, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so this is adding an environment variable to optionally enable
the separate control message endpoint, but for now we use as a default the old
behavior of sending the control message round-robining across rails.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 22, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail and with its own CQ. To avoid the cost of polling an
additional CQ, this commit is making the control endpoint share the CQ with
rail 0.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 22, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so we want the option to still distribute the control messages
across all rails.
This commit is providing an environment variable to choose between a) using a
single dedicated endpoint for control messages, sharing the CQ with the data
endpoint on rail 0; or b) use as many additional endpoints for the control
messages as the number of rails, sharing the CQs with the data endpoints, and
use the scheduler to round robin the control messages across all rails.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 23, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so we want the option to still distribute the control messages
across all rails.
This commit is providing an environment variable to choose between a) using a
single dedicated endpoint for control messages, sharing the CQ with the data
endpoint on rail 0; or b) use as many additional endpoints for the control
messages as the number of rails, sharing the CQs with the data endpoints, and
use the scheduler to round robin the control messages across all rails.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 24, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail and with its own CQ. To avoid the cost of polling an
additional CQ, this commit is making the control endpoint share the CQ with
rail 0.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 24, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so we want the option to still distribute the control messages
across all rails.
This commit is providing an environment variable to choose between a) using a
single dedicated endpoint for control messages, sharing the CQ with the data
endpoint on rail 0; or b) use as many additional endpoints for the control
messages as the number of rails, sharing the CQs with the data endpoints, and
use the scheduler to round robin the control messages across all rails.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 26, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail and with its own CQ. To avoid the cost of polling an
additional CQ, this commit is making the control endpoint share the CQ with
rail 0.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
AmedeoSapio added a commit to AmedeoSapio/aws-ofi-nccl that referenced this pull request Oct 26, 2024
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so we want the option to still distribute the control messages
across all rails.
This commit is providing an environment variable to choose between a) using a
single dedicated endpoint for control messages, sharing the CQ with the data
endpoint on rail 0; or b) use as many additional endpoints for the control
messages as the number of rails, sharing the CQs with the data endpoints, and
use the scheduler to round robin the control messages across all rails.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
rajachan pushed a commit that referenced this pull request Oct 26, 2024
PR #543 has moved the control message to its own dedicated
endpoint on a single rail and with its own CQ. To avoid the cost of polling an
additional CQ, this commit is making the control endpoint share the CQ with
rail 0.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
rajachan pushed a commit that referenced this pull request Oct 26, 2024
PR #543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so we want the option to still distribute the control messages
across all rails.
This commit is providing an environment variable to choose between a) using a
single dedicated endpoint for control messages, sharing the CQ with the data
endpoint on rail 0; or b) use as many additional endpoints for the control
messages as the number of rails, sharing the CQs with the data endpoints, and
use the scheduler to round robin the control messages across all rails.

Signed-off-by: Amedeo Sapio <asapio@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants