Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: NCCL WARN NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented #801

Open
xylian86 opened this issue Mar 5, 2025 · 0 comments

Comments

@xylian86
Copy link

xylian86 commented Mar 5, 2025

Hi, I am using aws-ofi.1.6, and when I initialize nccl with torch, I encountered this issue.

There are several warning and errors, as shown in the following trace.

 libfabric:507357:1741146011::cxi:fabric:cxip_nic_get_rgroup_vni():253<warn> : Failed to find valid default rgroup and vni for cxi0 

NCCL INFO NET/OFI Selected Provider is cxi (found 8 nics)
[0] NCCL INFO Using non-device net plugin version 0
[0] NCCL INFO Using network AWS Libfabric
[0] NCCL INFO DMA-BUF is available on GPU device 0
[0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.6.0
[0] NCCL INFO NET/OFI Selected Provider is cxi (found 8 nics)
[0] NCCL INFO Using non-device net plugin version 0
[0] NCCL INFO Using network AWS Libfabric
[0] NCCL INFO DMA-BUF is available on GPU device 0
[0] NCCL INFO comm 0xaaaaf6b29030 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 2901000 commId 0x8e6ef3d51f699a17 - Init START
[0] NCCL INFO comm 0xaaaaf6798e60 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 2901000 commId 0x8e6ef3d51f699a17 - Init START
[0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
[0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff,ffff0000,00000000,00000000,00000000,00000000
[0] NCCL INFO comm 0xaaaaf6b29030 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
[0] NCCL INFO Channel 00/02 :    0   1
[0] NCCL INFO Channel 01/02 :    0   1
[0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
 [0] NCCL INFO P2P Chunksize set to 131072
[0] NCCL INFO comm 0xaaaaf6798e60 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
[0] NCCL INFO P2P Chunksize set to 131072
[0] create_nccl_ofi_component:919 NCCL WARN NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented
[0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/AWS Libfabric/5
[0] create_nccl_ofi_component:919 NCCL WARN NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented
[rank1]:     work = group.barrier(opts=opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^
 [rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
[rank1]: ncclInternalError: Internal check failed.
[rank1]: Last error:
[rank1]: NET/OFI Couldn't open a fabric access domain. RC: -38, ERROR: Function not implemented

Appreciate for any help or suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant