You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we are currently using a p5.48xlarge instance as a CI runner. We set up multiple runners on it with disjoint GPU subsets (0,1, 2,3, 4,5,6,7). We ran into the following NCCL error when running tests in parallel:
NCCL_DEBUG logs
2a8874e7d1e6:281:281 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:281:281 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ^docker,lo,veth_def_agent
2a8874e7d1e6:281:281 [0] NCCL INFO Bootstrap : Using eth0:172.20.0.2<0>
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
2a8874e7d1e6:281:281 NCCL CALL ncclGetUniqueId(0x7de6c0998b0f6a4)
2a8874e7d1e6:281:281 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
Traceback (most recent call last):
File "/__w/process_group.py", line 42, in __init__
torch.distributed.init_process_group(
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
func_return = func(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
default_pg, _ = _new_process_group_helper(
File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
eager_backend.eager_connect_single_device(device_id)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
ncclInternalError: Internal check failed.
Last error:
Attribute arch of node cpu not found
2a8874e7d1e6:2196:2196 [1] NCCL INFO cudaDriverVersion 12040
2a8874e7d1e6:2196:2196 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2196 [1] NCCL INFO NCCL_SOCKET_IFNAME set to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2196 [1] NCCL INFO Bootstrap : Using eth0:172.20.0.2<0>
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
2a8874e7d1e6:2196:2196 [1] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fbd79e00000
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Using CUDA driver version 12040
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Configuring AWS-specific options
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting provider_filter to efa
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Internode latency set at 75.0 us
2a8874e7d1e6:2196:2275 [1] nccl_net_ofi_create_plugin:1067 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
2a8874e7d1e6:2196:2275 [1] NCCL INFO net.cc:56 -> 2
2a8874e7d1e6:2196:2275 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/IB : No device found.
2a8874e7d1e6:2196:2275 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/Socket : Using [0]eth0:172.20.0.2<0>
2a8874e7d1e6:2196:2275 [1] NCCL INFO Using non-device net plugin version 0
2a8874e7d1e6:2196:2275 [1] NCCL INFO Using network Socket
2a8874e7d1e6:2196:2275 [1] NCCL INFO ncclCommInitRank comm 0x55742386bb30 rank 1 nranks 2 cudaDev 1 nvmlDev 5 busId a8000 commId 0x7de6c0998b0f6a4 - Init START
2a8874e7d1e6:2196:2275 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
2a8874e7d1e6:2196:2275 [1] NCCL INFO Loading topology file /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
2a8874e7d1e6:2196:2275 [1] NCCL INFO Loading unnamed topology
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'
2a8874e7d1e6:2196:2275 [1] graph/xml.h:101 NCCL WARN Attribute arch of node cpu not found
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:493 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:623 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:773 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO init.cc:1012 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO init.cc:1548 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
2a8874e7d1e6:2196:2196 [1] NCCL INFO group.cc:418 -> 3
2a8874e7d1e6:2196:2196 [1] NCCL INFO init.cc:1929 -> 3
The following two log lines are especially relevant:
2a8874e7d1e6:2196:2275 [1] nccl_net_ofi_create_plugin:1067 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
2a8874e7d1e6:2196:2275 [1] graph/xml.h:101 NCCL WARN Attribute arch of node cpu not found
We tried the following two ways for GPU isolation. Both trigger this problem.
CUDA_VISIBLE_DEVICES=GPU-xxx,GPU-xxx
docker run --gpus \"device=GPU-xxx,GPU-xxx\"
Do you have any clues on what's going on? Thanks!
The text was updated successfully, but these errors were encountered:
Hi @abcdabcd987:
The error looks from NCCL tried to load a static profile /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml in ncclTopoAddCpu().
2a8874e7d1e6:2196:2275 [1] graph/xml.h:101 NCCL WARN Attribute arch of node cpu not found
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:493 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:623 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:773 -> 3
The static topo file had issues for some use cases. After plugin version 1.11, aws-ofi-nccl moved from a static config file into a dynamically generated topology file that only include devices that NCCL will find. Recommend to update the aws-ofi-nccl plug-in to a more recent version like v1.13.2-aws and see if it resolves this problem.
Hi, we are currently using a p5.48xlarge instance as a CI runner. We set up multiple runners on it with disjoint GPU subsets (
0,1
,2,3
,4,5,6,7
). We ran into the following NCCL error when running tests in parallel:NCCL_DEBUG logs
The following two log lines are especially relevant:
We tried the following two ways for GPU isolation. Both trigger this problem.
CUDA_VISIBLE_DEVICES=GPU-xxx,GPU-xxx
docker run --gpus \"device=GPU-xxx,GPU-xxx\"
Do you have any clues on what's going on? Thanks!
The text was updated successfully, but these errors were encountered: