Triton Intro

September 9, 2025 - 2 mins read

My Motivation

In this new series, I’m going to dive into the world of the NVIDIA Triton Inference Server¹. If you’re an ML engineer like me, you might have found it tricky to get started with, thanks to the lack of clear examples and tutorials out there. That’s exactly why I’m writing this – to help you get the hang of it and use it for your own projects.

What We’ll Be Covering

Consider this a mini-course where we’ll walk through everything from the basics to more advanced topics. We’ll start with setting up your first “echo” server and understanding the config.pbtxt file, then work our way up to deploying a deep learning model and benchmarking model ensembles.

Here’s a sneak peek at what you’ll learn:

The config.pbtxt file
- How to set up batches
- How to pass custom parameters
- How to tell your model to use either a GPU or a CPU
- How to create a model ensemble to run several models in a sequence
How to set up model repository
How to use Python scripts as a model
How to use onnx models, both with and without batches
How to pass custom data between the client and server
How to make asynchronous requests to your model

The Tools of the Trade

Throughout this mini-course, we’ll be using a few key tools:

Python: We’ll use this for running models on the server and making requests from the client.
OpenCV: This will help us manage image and frame processing.
NumPy: You’ll see that this is a crucial part of the Triton ecosystem, as it’s the go-to for passing data to and from the server.
The RF-DETR² model from Roboflow will be our example for inference.

To follow along with the GPU examples, you’ll need access to an NVIDIA GPU and Docker. Docker is a must because it’s the most straightforward way to run the Triton server.

To get the RF-DETR model running inside the server, you’ll need to build your own custom Docker image that includes both the Triton server and the RF-DETR model with all its dependencies. Don’t worry, I’ll walk you through exactly how to do that!

References

NVIDIA Triton Inference Server: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html ↩︎
RF-DETR: SOTA Real-Time Object Detection Model https://github.com/roboflow/rf-detr ↩︎

This is a post in the Triton series.
Other posts in this series:

October 12, 2025 - Triton. Part 4. Serving RF-DETR
September 22, 2025 - Triton. Part 3. Working with Images.
September 13, 2025 - Triton. Part 2. Echo Json.
September 9, 2025 - Triton. Part 1. Echo Tensor
September 9, 2025 - Triton Intro