Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error launching model on Triton on multigpu nodes #698

Open
sujituk opened this issue Feb 7, 2025 · 0 comments
Open

Error launching model on Triton on multigpu nodes #698

sujituk opened this issue Feb 7, 2025 · 0 comments

Comments

@sujituk
Copy link

sujituk commented Feb 7, 2025

Background:
Setup GKE node pool with 2 H100 nodes (8GPUs each) and required NFS storage. Trying to serve Llama3 405B model after checkpoint conversion and building TRT-LLM engine.

Environment:
Triton image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
Triton version: 0.16
mpirun (Open MPI) 4.1.5rc2

Issue:
Launching on leader node using command line:
python3 /var/run/models/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo=/var/run/models/tensorrtllm_backend/triton_model_repo --world_size 16

it fails with error:


Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo'
Server is assuming each node has 8 GPUs. To change this, use --gpu_per_node
Executing Leader (world size: 16)
Begin waiting for worker pods.

kubectl get pods -n default -l leaderworkerset.sigs.k8s.io/group-key=<redacted> --field-selector status.phase=Running -o jsonpath='{.items[*].metadata.name}'
'triton-trtllm-0 triton-trtllm-0-1'
2 of 2.0 workers ready.

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

  • not finding the required libraries and/or binaries on
    one or more nodes. Please check your PATH and LD_LIBRARY_PATH
    settings, or configure OMPI with --enable-orterun-prefix-by-default

  • lack of authority to execute on one or more specified nodes.
    Please verify your allocation and authorities.

  • the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
    Please check with your sys admin to determine the correct location to use.

  • compilation of the orted with dynamic libraries when static are required
    (e.g., on Cray). Please check your configure cmd line and consider using
    one of the contrib/platform definitions for your system type.

  • an inability to create a connection back to mpirun due to a
    lack of common network interfaces and/or no route found between
    them. Please check network connectivity (including firewalls
    and network routing requirements).


[triton-trtllm-0:00238] Job UNKNOWN has launched
[triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,1]
[triton-trtllm-0:00238] sess_dir_finalize: proc session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: jobfam session dir does not exist
[triton-trtllm-0:00238] sess_dir_finalize: top session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist
[triton-trtllm-0:00238] [[41321,0],0] Releasing job data for [41321,0]
[triton-trtllm-0:00238] sess_dir_cleanup: job session dir does not exist
[triton-trtllm-0:00238] sess_dir_cleanup: top session dir does not exist
exiting with status 1
Waiting 15 second before exiting.

Launched on non-leader node
Triton model repository is at:'/var/run/models/tensorrtllm_backend/triton_model_repo' Worker paused awaiting SIGINT or SIGTERM.
Verified: mpirun is on the path.

Question:
mpirun works fine in a single node. Is there any configuration that needs to be done when mpirun spans multiple nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant