The easiest way to setup and use Triton Inference Server is using a Docker image. There are 3 main steps involved in the setup.
- Install Docker specific to Operating System from here.
- Docker can also be installed using a convenience script. Please read about the potential risks and limitations here. Installation at the time of testing was done using the repository here.
- Install the NVIDIA Container Toolkit for GPU compatibility from here(Not available on Windows)
- Pull the latest Triton Docker image using the following command
$ docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3
Replace <xx.yy> with the latest release of Triton which can be found here.
Testing at the time was done on r21.05 using the command
$ docker pull nvcr.io/nvidia/tritonserver:21.05-py3
Here you can find some post-installation steps for Docker.(These are not mandatory and some are just for convenience)
The model repository is the directory where you place the models to serve using triton. Setup the repository and fetch model files using the inception_setup.sh script file.
./inception_setup.sh
For unet based on this model repository run the following
./unet_setup.sh
You can read briefly about setting up custom model repositories here.
$ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:21.05-py3 tritonserver --model-repository=/models --strict-model-config=false
To run without gpus, remove the gpus=1 flag For multiple gpus, change the gpus flag to the number of gpus or use gpus=all
All models in the model repository should be loaded with version number and status displayed (STATUS should be ready) NOTE: It is possible that using $(pwd) may not load all the models, use absolute path of the model_repository folder in that case
Use Triton’s ready endpoint to verify that the server and the models are ready for inference. From the host system use curl to access the HTTP endpoint that indicates server status.
$ curl -v localhost:8000/v2/health/ready
...
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.
You can also check the availability of your models as v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready
For example to check unet
$ curl -v localhost:8000/v2/models/unet/versions/1/ready
We can test the inference for the inception model for image classification using our sample frontend for image classification.
You can also refer to API_requests.md for example GET and POST requests that are used in the frontend and can also be tested using postman.
Use the dali-backend to implement end-to-end inference on triton itself including image preprocessing, model inference and post-processing of results using the feature to have an ensemble of models. An example of the same can be found here.