-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdma: Support early completion of recv() requests #797
base: master
Are you sure you want to change the base?
rdma: Support early completion of recv() requests #797
Conversation
src/nccl_ofi_api.c
Outdated
* Reset to NULL, unless provider's data progress model is FI_PROGRESS_AUTO, | ||
* for which plugin supports early completion mode for recv() with LL* protocols | ||
*/ | ||
if (*request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) { | ||
if (data_progress_auto == false && *request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) { | ||
*request = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag should be early_completions
or something, with the behavior being determined by the progress mode of the underlying provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a flag to toggle this at runtime as well for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure we need this check at all, looking at the code more. Whether or not we're doing early completion, the code below doesn't really care if *request
is set to something or not. Either it's the rdma transport, and it looks at it or it's the sendrecv transport and it blindly overwrites *request
. So I think this entire chunk of code should go away.
If it doesn't go away, 100% agree with Raghu's comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it, since now we've implemented early completion logic. We properly handle request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION
in rdma now, thus no need to reset to NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added an env var. Validated on both EFA-rdm and EFA-direct. when OFI_NCCL_EARLY_COMPL set to 1
EFA-rdm: ect-p5en-odcr-queue-dy-p5en48xlarge-2:601518:601606 [1] nccl_net_ofi_rdma_init:8171 NCCL WARN NET/OFI Early completion enablement failed due to provider data progress model is not FI_PROGRESS_AUTO
EFA-direct: Pass
src/nccl_ofi_rdma.c
Outdated
* If the provider data progress model is FI_PROGRESS_AUTO, we have an | ||
* opportunity to early-complete a recv() request with a hint from NCCL | ||
* v9 API when it uses LL* protocols. In that case, sender uses fi_write | ||
* to avoid cq entry insertion. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd definitely expand on this explanation, covering the test() behaviors in LL protocols that makes doing this okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment nor this code are correct. In post_rdma_write()
(this function), we should call fi_write()
if the remote asked for early completion and we should call fi_writedata()
if the remote side did not ask for early completion. The whole behavior around progress models needs to be in deciding whether or not the transport supports early completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are correct, reverted to the previous way of using a new CTRL type to communicate the sender behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something missing in this patch. The receiver needs to specify whether to use early completion or not for each request. There's no change to request header data in this patch, so we're clearly not doing that.
and also reverted to the previous way of checking req pointer and setting a recv_completion_optional
for each request. (over-refactored the code...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something missing in this patch. The receiver needs to specify whether to use early completion or not for each request. There's no change to request header data in this patch, so we're clearly not doing that.
src/nccl_ofi_api.c
Outdated
* Reset to NULL, unless provider's data progress model is FI_PROGRESS_AUTO, | ||
* for which plugin supports early completion mode for recv() with LL* protocols | ||
*/ | ||
if (*request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) { | ||
if (data_progress_auto == false && *request == (void *)NCCL_NET_OPTIONAL_RECV_COMPLETION) { | ||
*request = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure we need this check at all, looking at the code more. Whether or not we're doing early completion, the code below doesn't really care if *request
is set to something or not. Either it's the rdma transport, and it looks at it or it's the sendrecv transport and it blindly overwrites *request
. So I think this entire chunk of code should go away.
If it doesn't go away, 100% agree with Raghu's comments.
src/nccl_ofi_rdma.c
Outdated
* If the provider data progress model is FI_PROGRESS_AUTO, we have an | ||
* opportunity to early-complete a recv() request with a hint from NCCL | ||
* v9 API when it uses LL* protocols. In that case, sender uses fi_write | ||
* to avoid cq entry insertion. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment nor this code are correct. In post_rdma_write()
(this function), we should call fi_write()
if the remote asked for early completion and we should call fi_writedata()
if the remote side did not ask for early completion. The whole behavior around progress models needs to be in deciding whether or not the transport supports early completion.
46c478d
to
a635123
Compare
a635123
to
dc27609
Compare
include/nccl_ofi_param.h
Outdated
/* | ||
* 1 to enable early completion, 0 to disable it. It also requires the underlying provider works in FI_PROGRESS_AUTO data progress model | ||
*/ | ||
OFI_NCCL_PARAM_INT(early_completion_enabled, "EARLY_COMPLETION", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should enable by default, at least on the release branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now, code doesn't differentiate default enabled vs user explicitly set to enable. So it'll error out with EFA-rdm branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, we need to fix that before release. Set the default to like -1
or something, and use that as a sentinel for "follow the progress model".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, updated in next revision
892a92f
to
ec05e81
Compare
ec05e81
to
68d1981
Compare
Description of changes:
Background:
When using LL (Low Latency) or LL128 protocols, NCCL sets the request pointer to NCCL_NET_OPTIONAL_RECV_COMPLETION in irecv() calls. This indicates that the plugin can complete a receiver request early without plugin explicitly polling the CQ to validate data arrival. This is achievable because NCCL itself following LL protocol semantics will validate data arrival by checking the flag bytes.
Plugin Optimization Details:
Requirements:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.