Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-ofi-nccl initialization failed when using a subset of GPUs on p5.48xlarge #785

Open
abcdabcd987 opened this issue Feb 17, 2025 · 1 comment

Comments

@abcdabcd987
Copy link

Hi, we are currently using a p5.48xlarge instance as a CI runner. We set up multiple runners on it with disjoint GPU subsets (0,1, 2,3, 4,5,6,7). We ran into the following NCCL error when running tests in parallel:

NCCL_DEBUG logs
2a8874e7d1e6:281:281 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:281:281 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ^docker,lo,veth_def_agent
2a8874e7d1e6:281:281 [0] NCCL INFO Bootstrap : Using eth0:172.20.0.2<0>
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
2a8874e7d1e6:281:281 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
2a8874e7d1e6:281:281 NCCL CALL ncclGetUniqueId(0x7de6c0998b0f6a4)
2a8874e7d1e6:281:281 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
Traceback (most recent call last):
  File "/__w/process_group.py", line 42, in __init__
    torch.distributed.init_process_group(
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
    func_return = func(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/opt/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
    eager_backend.eager_connect_single_device(device_id)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
ncclInternalError: Internal check failed.
Last error:
Attribute arch of node cpu not found
2a8874e7d1e6:2196:2196 [1] NCCL INFO cudaDriverVersion 12040
2a8874e7d1e6:2196:2196 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2196 [1] NCCL INFO NCCL_SOCKET_IFNAME set to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2196 [1] NCCL INFO Bootstrap : Using eth0:172.20.0.2<0>
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v7)
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
2a8874e7d1e6:2196:2196 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
2a8874e7d1e6:2196:2196 [1] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fbd79e00000
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.8.1-aws
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Using CUDA driver version 12040
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Configuring AWS-specific options
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting provider_filter to efa
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 for Hopper GPUs
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Running on p5.48xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/OFI Internode latency set at 75.0 us

2a8874e7d1e6:2196:2275 [1] nccl_net_ofi_create_plugin:1067 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
2a8874e7d1e6:2196:2275 [1] NCCL INFO net.cc:56 -> 2
2a8874e7d1e6:2196:2275 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/IB : No device found.
2a8874e7d1e6:2196:2275 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo,veth_def_agent
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/Socket : Using [0]eth0:172.20.0.2<0>
2a8874e7d1e6:2196:2275 [1] NCCL INFO Using non-device net plugin version 0
2a8874e7d1e6:2196:2275 [1] NCCL INFO Using network Socket
2a8874e7d1e6:2196:2275 [1] NCCL INFO ncclCommInitRank comm 0x55742386bb30 rank 1 nranks 2 cudaDev 1 nvmlDev 5 busId a8000 commId 0x7de6c0998b0f6a4 - Init START
2a8874e7d1e6:2196:2275 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
2a8874e7d1e6:2196:2275 [1] NCCL INFO Loading topology file /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml
2a8874e7d1e6:2196:2275 [1] NCCL INFO Loading unnamed topology
2a8874e7d1e6:2196:2275 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'

2a8874e7d1e6:2196:2275 [1] graph/xml.h:101 NCCL WARN Attribute arch of node cpu not found
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:493 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:623 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:773 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO init.cc:1012 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO init.cc:1548 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
2a8874e7d1e6:2196:2196 [1] NCCL INFO group.cc:418 -> 3
2a8874e7d1e6:2196:2196 [1] NCCL INFO init.cc:1929 -> 3

The following two log lines are especially relevant:

2a8874e7d1e6:2196:2275 [1] nccl_net_ofi_create_plugin:1067 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
2a8874e7d1e6:2196:2275 [1] graph/xml.h:101 NCCL WARN Attribute arch of node cpu not found

We tried the following two ways for GPU isolation. Both trigger this problem.

  1. CUDA_VISIBLE_DEVICES=GPU-xxx,GPU-xxx
  2. docker run --gpus \"device=GPU-xxx,GPU-xxx\"

Do you have any clues on what's going on? Thanks!

@mozarhua
Copy link
Contributor

mozarhua commented Feb 21, 2025

Hi @abcdabcd987:
The error looks from NCCL tried to load a static profile /opt/amazon/efa/share/aws-ofi-nccl/xml/p5.48xl-topo.xml in ncclTopoAddCpu().

2a8874e7d1e6:2196:2275 [1] graph/xml.h:101 NCCL WARN Attribute arch of node cpu not found
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:493 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:623 -> 3
2a8874e7d1e6:2196:2275 [1] NCCL INFO graph/topo.cc:773 -> 3

The static topo file had issues for some use cases. After plugin version 1.11, aws-ofi-nccl moved from a static config file into a dynamically generated topology file that only include devices that NCCL will find. Recommend to update the aws-ofi-nccl plug-in to a more recent version like v1.13.2-aws and see if it resolves this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants