You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background:
Setup GKE node pool with 2 H100 nodes (8GPUs each) and required NFS storage. Trying to serve Llama3 405B model after checkpoint conversion and building TRT-LLM engine.
Issue:
Launching on leader node using command line: python3 /var/run/models/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo=/var/run/models/tensorrtllm_backend/triton_model_repo --world_size 16
it fails with error:
Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo'
Server is assuming each node has 8 GPUs. To change this, use --gpu_per_node
Executing Leader (world size: 16)
Begin waiting for worker pods.
kubectl get pods -n default -l leaderworkerset.sigs.k8s.io/group-key=<redacted> --field-selector status.phase=Running -o jsonpath='{.items[*].metadata.name}'
'triton-trtllm-0 triton-trtllm-0-1'
2 of 2.0 workers ready.
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
[triton-trtllm-0:00238] Job UNKNOWN has launched
[triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,1]
[triton-trtllm-0:00238] sess_dir_finalize: proc session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: top session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist
[triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,0]
[triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist
exiting with status 1
Waiting 15 second before exiting.
Launched on non-leader node Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Worker paused awaiting SIGINT or SIGTERM. Verified: mpirun is on the path.
Question: mpirun works fine in a single node. Is there any configuration that needs to be done when mpirun spans multiple nodes.
The text was updated successfully, but these errors were encountered:
Background:
Setup GKE node pool with 2 H100 nodes (8GPUs each) and required NFS storage. Trying to serve Llama3 405B model after checkpoint conversion and building TRT-LLM engine.
Environment:
Triton image:
nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
Triton version:
0.16
mpirun (Open MPI)
4.1.5rc2
Issue:
Launching on leader node using command line:
python3 /var/run/models/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo=/var/run/models/tensorrtllm_backend/triton_model_repo --world_size 16
it fails with error:
Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo'
Server is assuming each node has 8 GPUs. To change this, use --gpu_per_node
Executing Leader (world size: 16)
Begin waiting for worker pods.
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
[triton-trtllm-0:00238] Job UNKNOWN has launched
[triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,1]
[triton-trtllm-0:00238] sess_dir_finalize: proc session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: top session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist
[triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,0]
[triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist
exiting with status 1
Waiting 15 second before exiting.
Launched on non-leader node
Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Worker paused awaiting SIGINT or SIGTERM.
Verified: mpirun is on the path.
Question:
mpirun
works fine in a single node. Is there any configuration that needs to be done when mpirun spans multiple nodes.The text was updated successfully, but these errors were encountered: