Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdma: Support early completion of recv() requests #797

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yexiang-aws
Copy link
Contributor

@yexiang-aws yexiang-aws commented Feb 27, 2025

Description of changes:

Background:
When using LL (Low Latency) or LL128 protocols, NCCL sets the request pointer to NCCL_NET_OPTIONAL_RECV_COMPLETION in irecv() calls. This indicates that the plugin can complete a receiver request early without plugin explicitly polling the CQ to validate data arrival. This is achievable because NCCL itself following LL protocol semantics will validate data arrival by checking the flag bytes.

Plugin Optimization Details:

  1. Receiver Side:
    1. Marks request completion immediately after CTRL message send completion
    2. Does not wait for RDMA write operation completion
  2. Sender Side:
    1. Uses fi_write instead of fi_writedata, to eliminate unnecessary CQ entries on RX side

Requirements:

  • Eager msg mode is diabled: eager_max_size == -1
  • Provider must use FI_PROGRESS_AUTO data progress model

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@yexiang-aws yexiang-aws requested review from bwbarrett and a team as code owners February 27, 2025 01:31
Comment on lines 793 to 797
* Reset to NULL, unless provider's data progress model is FI_PROGRESS_AUTO,
* for which plugin supports early completion mode for recv() with LL* protocols
*/
if (*request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) {
if (data_progress_auto == false && *request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) {
*request = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag should be early_completions or something, with the behavior being determined by the progress mode of the underlying provider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a flag to toggle this at runtime as well for debugging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure we need this check at all, looking at the code more. Whether or not we're doing early completion, the code below doesn't really care if *request is set to something or not. Either it's the rdma transport, and it looks at it or it's the sendrecv transport and it blindly overwrites *request. So I think this entire chunk of code should go away.

If it doesn't go away, 100% agree with Raghu's comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it, since now we've implemented early completion logic. We properly handle request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION in rdma now, thus no need to reset to NULL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an env var. Validated on both EFA-rdm and EFA-direct. when OFI_NCCL_EARLY_COMPL set to 1

EFA-rdm: ect-p5en-odcr-queue-dy-p5en48xlarge-2:601518:601606 [1] nccl_net_ofi_rdma_init:8171 NCCL WARN NET/OFI Early completion enablement failed due to provider data progress model is not FI_PROGRESS_AUTO
EFA-direct: Pass

Comment on lines 5521 to 5525
* If the provider data progress model is FI_PROGRESS_AUTO, we have an
* opportunity to early-complete a recv() request with a hint from NCCL
* v9 API when it uses LL* protocols. In that case, sender uses fi_write
* to avoid cq entry insertion.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely expand on this explanation, covering the test() behaviors in LL protocols that makes doing this okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment nor this code are correct. In post_rdma_write() (this function), we should call fi_write() if the remote asked for early completion and we should call fi_writedata() if the remote side did not ask for early completion. The whole behavior around progress models needs to be in deciding whether or not the transport supports early completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct, reverted to the previous way of using a new CTRL type to communicate the sender behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something missing in this patch. The receiver needs to specify whether to use early completion or not for each request. There's no change to request header data in this patch, so we're clearly not doing that.

and also reverted to the previous way of checking req pointer and setting a recv_completion_optional for each request. (over-refactored the code...)

Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something missing in this patch. The receiver needs to specify whether to use early completion or not for each request. There's no change to request header data in this patch, so we're clearly not doing that.

Comment on lines 793 to 797
* Reset to NULL, unless provider's data progress model is FI_PROGRESS_AUTO,
* for which plugin supports early completion mode for recv() with LL* protocols
*/
if (*request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) {
if (data_progress_auto == false && *request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) {
*request = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure we need this check at all, looking at the code more. Whether or not we're doing early completion, the code below doesn't really care if *request is set to something or not. Either it's the rdma transport, and it looks at it or it's the sendrecv transport and it blindly overwrites *request. So I think this entire chunk of code should go away.

If it doesn't go away, 100% agree with Raghu's comments.

Comment on lines 5521 to 5525
* If the provider data progress model is FI_PROGRESS_AUTO, we have an
* opportunity to early-complete a recv() request with a hint from NCCL
* v9 API when it uses LL* protocols. In that case, sender uses fi_write
* to avoid cq entry insertion.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment nor this code are correct. In post_rdma_write() (this function), we should call fi_write() if the remote asked for early completion and we should call fi_writedata() if the remote side did not ask for early completion. The whole behavior around progress models needs to be in deciding whether or not the transport supports early completion.

@yexiang-aws yexiang-aws force-pushed the feature/LL128_early_completion branch 2 times, most recently from 46c478d to a635123 Compare February 28, 2025 02:09
@yexiang-aws yexiang-aws force-pushed the feature/LL128_early_completion branch from a635123 to dc27609 Compare February 28, 2025 16:59
bwbarrett
bwbarrett previously approved these changes Feb 28, 2025
/*
* 1 to enable early completion, 0 to disable it. It also requires the underlying provider works in FI_PROGRESS_AUTO data progress model
*/
OFI_NCCL_PARAM_INT(early_completion_enabled, "EARLY_COMPLETION", 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should enable by default, at least on the release branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, code doesn't differentiate default enabled vs user explicitly set to enable. So it'll error out with EFA-rdm branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, we need to fix that before release. Set the default to like -1 or something, and use that as a sentinel for "follow the progress model".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, updated in next revision

@yexiang-aws yexiang-aws force-pushed the feature/LL128_early_completion branch 3 times, most recently from 892a92f to ec05e81 Compare February 28, 2025 17:37
@yexiang-aws yexiang-aws force-pushed the feature/LL128_early_completion branch from ec05e81 to 68d1981 Compare February 28, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants