Triton Inference Server

10 May 2023. 7 minutes read

Triton Inference Server featured image

In this blog post, I describe my first thoughts about Triton after some time spent playing around, describe its most important features, and when and how to use it. In the next blog post, I will describe tips and tricks, present a cheat sheet, and provide a code sample for benchmarking different model configurations for computer vision models. Let's get started!

What is that?

Triton is a high-performing machine learning inference runtime present in many machine learning deployment tools, such as Seldon Core, KServe, MLFlow, or BentoML. Its strength lies in its optimized performance thanks to C++ implementation and a rich set of features that make this tool versatile and suitable for almost every application. Triton supports multiple frameworks, such as NVIDIA TensorRT, TensorFlow, PyTorch, OpenVINO, and ONNX, as well as custom C++ and Python backend, integrates with Kubernetes, and provides metrics for service monitoring out-of-the-box which makes it easy to use in production environments.

Source: https://docs.nvidia.com

‍

Triton is available in JetPack SDK, which is supported by NVIDIA Jetson computers suitable for AI on the edge. Thanks to Triton’s C API one can make real-time inferences without network overhead.

Triton comes to the rescue with the deployment of large language models. You can read more about this on this blog.

First contact

There are various tutorials available in the official repository. Mind that they contain several minor bugs or just need some adjustments. I’ve created PR with fixes. Once these are fixed, the code runs properly and provides an easy way to familiarise yourself with Triton's main concepts, such as Triton architecture, concurrent model execution, model repository structure, model configuration file, and optimization techniques.

The tools provided by Triton, perf_analyzer and model_analyzer, are useful for automatically determining the optimal parameters for deployment on specific hardware. Very useful when deploying models on a production server.

Unfortunately, Triton is not an easy tool to use. There are many exceptions in configuring models, conditions that cannot be combined, etc. Server errors when trying to run such a configuration throws vague error messages. In the sections below, I will share my observations of these exceptions and the rules I followed when experimenting with Triton.

Model configuration

Below is a minimal model configuration file for a model with 2 input tensors and 1 output tensor. There are a few mandatory fields: name which has to match the model’s directory name in model-repository provided to Triton, platform defining a type of model runtime, input and output defining input data to the model.

Possible platforms are onnxruntime_onnx, tensorrt_plan, pytorch_libtorch, tensorflow_savedmodel, and tensorflow_graphdef. You have to assign proper platform-matching model file extensions: .onnx, .plan, .pt, etc.

Data types are defined in the documentation. The field dims defines a dimension of a tensor. For example, the dimension of model processing ImageNet images would be dims: [3, 224, 224] for input and dims: [1000] for output.

Important note: inference on batches is possible by specifying max_batch_size value, but one has to ensure that the exported model is capable of processing such data. For the TensorRT model, one has to export the model with defined shapes for minimum and maximum accepted batch size. I will elaborate on this problem in the next post about Triton, so stay tuned!

Instance groups

This is a feature that allows the creation of multiple instances of a model to parallelize the inference process and, hopefully, increase throughput and decrease latency while utilizing more resources.

Concurrent model execution

From the documentation: “The Triton architecture allows multiple models and/or multiple instances of the same model to execute in parallel on the same system.”. All arriving requests are scheduled for execution in queues. Depending on the instance gpus configuration which defines the number of parallel model instances, requests may be computed in parallel for both GPU and CPU.

Dynamic batching

To improve the throughput of a model one can aggregate data into larger batches and then make model inference on a batch instead of multiple smaller data inputs. That’s exactly what Triton’s dynamic batching feature does. There are a few parameters to configure while using this feature: preferred batch size and max queue delay. The first one is quite self-explanatory; the second one defines how much time to aggregate incoming requests to form a dynamic batch.

From Triton’s documentation: “By default, the requests can be dynamically batched only if each input has the same shape across the requests. In order to exploit dynamic batching for cases where input shapes often vary, the client would need to pad the input tensors in the requests to the same shape.”. For the dynamic shape of the model’s inputs, there is a ragged batching feature. You can learn more about it here.

Priorities

Triton provides capabilities to create multiple queues with different priority levels. By default, there is only one queue. Priorities are assigned for each deployed model separately, through a configuration file.

Cache

You can optimize the use of resources by enabling response cache. It’s enabled in each model separately, in the configuration file but requires that Triton is started with --cache-config local,size=SIZE flags. You can read more details about this feature in the documentation.

Model warmup

Some models after deployment may suffer from a cold start. This is a situation when the first few requests processed by the model are taking noticeably longer time to complete. To prevent this issue Triton has implemented functionality to dry-run model inference before marking them as READY.

Model repository

To start the Triton server with the given model repository run the command below:

Instead of a path to the local model repository you can provide the path to cloud storage, such as S3, Google Cloud Storage, and Azure Storage.

Model repository directory structure must follow a pattern:

Triton supports multiple versions of deployed models.

Metrics

Triton provides Prometheus metrics GPU and requests statistics by default.

Business Logic Scripting

To speed up model inference, one can use the business logic scripting (BLS) feature to separate pre- and post-processing steps, written in Python/C++/Java, from core model inference. The core model can be optimized, for instance, by exporting it as a TensorRT model with proper configuration. The whole pipeline can be established using Triton’s ensemble model, which defines a pipeline of operations with input and output parameters. In some cases, it can ensure better utilization of deployed models by sharing core models between different ensemble models.

Should I use it?

Triton should be used in production environments where all its features can be utilized. It shines with its ability to gather requests dynamically into batches and scale models horizontally with multiple instances. Those features can be used only in sufficiently high requests load. Comparing the performance of a model running in a native framework, such as PyTorch, and Triton, the latter may be slower for a low traffic volume due to the overhead of queuing, batching, and other processing steps. However, Triton will be more effective for large request workloads for which dynamic batching and scaling of the number of models will be used.

Since Triton is so great, should I use it in every situation? If you are not frightened by a drop in throughput of a few percentage points then Triton is a good choice for you. Deployment with it will allow you to scale your solution seamlessly to production with minimal configuration changes.

How to use it?

Triton comes with two powerful tools to optimize models’ configuration for given hardware and software environments: perfanalyzer and _model_analyzer. Optimal configuration can differ between machines and therefore You should always tune Your configuration for a target production server.

First, export Your models into ONNX and TensorRT formats with different optimizations, such as dynamic batches, half-precision, and quantization. Next, run model_analyzer with the specified search space of the parameters. It may take a long time, so make Yourself a cup of coffee or go on a lunch break. After this step finishes, You can generate a summary report comparing different configurations. Lastly, You can select and apply the best configuration and run the production Triton service.

Summary

Triton is a production-grade tool for model deployment thanks to high performance and optimized utilization of resources. Tools build around Triton, such as perf_analyzer and model-analyzer, are very useful to fast test different configurations and select the best one for deployment. It’s not so easy to use at first but as You will get familiar with its concepts and common problems it will be a handy tool in the hands of the machine learning or DevOps engineer. Triton is used as underlying runtime in other deployment tools as a part of larger MLOps pipelines which makes the deployment process even simpler. Therefore it’s good to know how to handle Triton’s configuration. In the next blog post, I will show tips&tricks and present code samples, so stay tuned!

Contents