You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Commit rdma: switch to untagged send/recv (12ed337) removed the use of tagged entries for NCCL protocol RDMA. Even though RDMA uses untagged send/recv operations, the CQ format attribute is initialized based on the capability flag from the provider.
OPX has FI_TAGGED set for the provider's capabilities, so the CQ format is being set to FI_CQ_FORMAT_TAGGED and then the CQ is filled with fi_cq_tagged_entry's by OPX. Should the spots that changed fi_cq_tagged_entry to fi_cq_data_entry still be fi_cq_tagged_entry to handle the CQ filling with that type?
Interestingly enough, this exact problem came up during the review for this code: #361 (comment). The conclusion was that FI_TAGGED is a "primary capability", which means it should not be enabled unless specifically requested by the application.
Capabilities may be grouped into three general categories: primary, secondary, and primary modifiers. Primary capabilities must explicitly be requested by an application, and a provider must enable support for only those primary capabilities which were selected. Primary modifiers are used to limit a primary capability, such as restricting an endpoint to being send-only. If no modifiers are specified for an applicable capability, all relevant modifiers are assumed. See above definitions for details.
My reading is that, since the plugin does not request FI_TAGGED capability, the OPX provider should not enable it.
Commit rdma: switch to untagged send/recv (12ed337) removed the use of tagged entries for NCCL protocol RDMA. Even though RDMA uses untagged send/recv operations, the CQ format attribute is initialized based on the capability flag from the provider.
aws-ofi-nccl/src/nccl_ofi_ofiutils.c
Lines 266 to 270 in d459367
OPX has FI_TAGGED set for the provider's capabilities, so the CQ format is being set to FI_CQ_FORMAT_TAGGED and then the CQ is filled with fi_cq_tagged_entry's by OPX. Should the spots that changed fi_cq_tagged_entry to fi_cq_data_entry still be fi_cq_tagged_entry to handle the CQ filling with that type?
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1113 in d459367
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1281 in d459367
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1389 in d459367
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1740 in d459367
Or should the cq_attr.format be set based on a different condition?
aws-ofi-nccl/src/nccl_ofi_ofiutils.c
Line 266 in d459367
The text was updated successfully, but these errors were encountered: