You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am using aws-ofi.1.6, and when I initialize nccl with torch, I encountered this issue.
There are several warning and errors, as shown in the following trace.
libfabric:507357:1741146011::cxi:fabric:cxip_nic_get_rgroup_vni():253<warn> : Failed to find valid default rgroup and vni for cxi0
NCCL INFO NET/OFI Selected Provider is cxi (found 8 nics)
[0] NCCL INFO Using non-device net plugin version 0
[0] NCCL INFO Using network AWS Libfabric
[0] NCCL INFO DMA-BUF is available on GPU device 0
[0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.6.0
[0] NCCL INFO NET/OFI Selected Provider is cxi (found 8 nics)
[0] NCCL INFO Using non-device net plugin version 0
[0] NCCL INFO Using network AWS Libfabric
[0] NCCL INFO DMA-BUF is available on GPU device 0
[0] NCCL INFO comm 0xaaaaf6b29030 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 2901000 commId 0x8e6ef3d51f699a17 - Init START
[0] NCCL INFO comm 0xaaaaf6798e60 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 2901000 commId 0x8e6ef3d51f699a17 - Init START
[0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
[0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
[0] NCCL INFO comm 0xaaaaf6b29030 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
[0] NCCL INFO Channel 00/02 : 0 1
[0] NCCL INFO Channel 01/02 : 0 1
[0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
[0] NCCL INFO P2P Chunksize set to 131072
[0] NCCL INFO comm 0xaaaaf6798e60 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
[0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
[0] NCCL INFO P2P Chunksize set to 131072
[0] create_nccl_ofi_component:919 NCCL WARN NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented
[0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/AWS Libfabric/5
[0] create_nccl_ofi_component:919 NCCL WARN NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented
[rank1]: work = group.barrier(opts=opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
[rank1]: ncclInternalError: Internal check failed.
[rank1]: Last error:
[rank1]: NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented
Appreciate for any help or suggestions!
The text was updated successfully, but these errors were encountered:
Hi, I am using
aws-ofi.1.6
, and when I initialize nccl with torch, I encountered this issue.There are several warning and errors, as shown in the following trace.
Appreciate for any help or suggestions!
The text was updated successfully, but these errors were encountered: