My Motivation
In this new series, I’m going to dive into the world of the NVIDIA Triton Inference Server1. If you’re an ML engineer like me, you might have found it tricky to get started with, thanks to the lack of clear examples and tutorials out there. That’s exactly why I’m writing this – to help you get the hang of it and use it for your own projects.
What We’ll Be Covering
Consider this a mini-course where we’ll walk through everything from the basics to more advanced topics. We’ll start with setting up your first “echo” server and understanding the config.pbtxt
file, then work our way up to deploying a deep learning model and benchmarking model ensembles.
Here’s a sneak peek at what you’ll learn:
- The
config.pbtxt
file- How to set up batches
- How to pass custom parameters
- How to tell your model to use either a GPU or a CPU
- How to create a model ensemble to run several models in a sequence
- How to set up model repository
- How to use Python scripts as a model
- How to use
onnx
models, both with and without batches - How to pass custom data between the client and server
- How to make asynchronous requests to your model
The Tools of the Trade
Throughout this mini-course, we’ll be using a few key tools:
- Python: We’ll use this for running models on the server and making requests from the client.
- OpenCV: This will help us manage image and frame processing.
- NumPy: You’ll see that this is a crucial part of the Triton ecosystem, as it’s the go-to for passing data to and from the server.
- The RF-DETR2 model from Roboflow will be our example for inference.
To follow along with the GPU examples, you’ll need access to an NVIDIA GPU and Docker. Docker is a must because it’s the most straightforward way to run the Triton server.
To get the RF-DETR model running inside the server, you’ll need to build your own custom Docker image that includes both the Triton server and the RF-DETR model with all its dependencies. Don’t worry, I’ll walk you through exactly how to do that!
References
-
NVIDIA Triton Inference Server: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html ↩︎
-
RF-DETR: SOTA Real-Time Object Detection Model https://github.com/roboflow/rf-detr ↩︎