This repository serves as an example of deploying the YOLO models on Triton Server for performance and testing purposes. It includes support for applications developed using Nvidia DeepStream.
Currently, only YOLOv7, YOLOv7 QAT, YOLOv8, YOLOv9 and YOLOv9 QAT are supported.
For YOLOv8 Models we are using custom plugin YOLO_NMS_TRT, the End2End implementation is not available on official Repo. Only on https://github.com/levipereira/ultralytics
For testing and evaluating YOLO models, you can utilize the repository triton-client-yolo
Evaluation test was perfomed using this client
Model ONNX > TensorRT | Test Size | APval | AP50val | AP75val |
---|---|---|---|---|
YOLOv9-C (FP16) | 640 | 52.9% | 70.1% | 57.7% |
YOLOv9-C ReLU (FP16) | 640 | 51.7% | 68.8% | 56.3% |
YOLOv9-E (FP16) | 640 | 55.4% | 72.6% | 60.3% |
YOLOv9-C QAT | 640 | 52.7% | 69.8% | 57.5% |
YOLOv9-C ReLU QAT | 640 | 51.6% | 69.7% | 56.3% |
YOLOv9-E QAT | 640 | 55.3% | 72.4% | 60.2% |
Model ONNX > TensorRT | Test Size | APval | AP50val | AP75val |
---|---|---|---|---|
YOLOv8n (FP16) | 640 | 37.3% | 52.6% | 40.5% |
YOLOv8s (FP16) | 640 | 44.9% | 61.6% | 48.6% |
YOLOv8m (FP16) | 640 | 50.1% | 67% | 54.6% |
YOLOv8l (FP16) | 640 | 52.7% | 69.6% | 57.4% |
YOLOv8x (FP16) | 640 | 53.7% | 70.7% | 58.6% |
Model ONNX > TensorRT | Test Size | APval | AP50val | AP75val |
---|---|---|---|---|
YOLOv7 (FP16) | 640 | 51.1% | 69.3% | 55.6% |
YOLOv7x (FP16) | 640 | 52.9% | 70.8% | 57.4% |
YOLOv7-QAT (INT8) | 640 | 50.9% | 69.2% | 55.5% |
YOLOv7x-QAT (INT8) | 640 | 52.5% | 70.6% | 57.3% |
This repository utilizes exported models using ONNX. It offers two types of ONNX models
- Model with Dynamic Shape and Dynamic Batch Size with End2End using Efficient NMS or YOLO_NMS_TRT plugin
It offers two types of configuration models:
- A model optimized for evaluation (
--topk-all 300 --iou-thres 0.7 --conf-thres 0.001
). - A model optimized for inference (
--topk-all 100 --iou-thres 0.45 --conf-thres 0.25
).
- Model with Dynamic Shapes and Dynamic Batching without End2End.
- The Non-Maximum Suppression must be handled by Client
Detailed Models can be found here
- Docker with support NVIDIA Container Toolkit must be installed.
- NVIDIA GPU(s) should be available.
git clone https://github.com/levipereira/triton-server-yolo.git
cd triton-server-yolo
# Start Docker container
bash ./start-container-triton-server.sh
- Download and install compiled
libnvinfer_plugin
cd TensorRTPlugin ./patch_libnvinfer.sh --download cd ..
- Or build from source the
libnvinfer_plugin
TensorRTPlugin
Inside Docker Container use bash ./start-triton-server.sh
This script is used to build TensorRT engines and start Triton-Server for YOLO models.
cd /apps
bash ./start-triton-server.sh
- --models: Specify the YOLO model name(s). Choose one or more with comma separation. Available options:
yolov9-c
yolov9-c-relu
yolov9-c-qat
yolov9-c-relu-qat
yolov9-e
yolov9-e-qat
yolov8n
yolov8s
yolov8m
yolov8l
yolov8x
yolov7
yolov7-qat
yolov7x
yolov7x-qat
- --model_mode: Use Model ONNX optimized for EVALUATION or INFERENCE. Choose from
'eval'
or'infer'
. - --plugin: Options:
'efficientNMS'
or'yoloNMS'
or'none'
. - --opt_batch_size: Specify the optimal batch size for TensorRT engines.
- --max_batch_size: Specify the maximum batch size for TensorRT engines.
- --instance_group: Specify the number of TensorRT engine instances loaded per model in the Triton Server.
- --force: Rebuild TensorRT engines even if they already exist.
- --reset_all: Purge all existing TensorRT engines and their respective configurations.
- Checks for the existence of YOLOv7/YOLOv9 ONNX model files.
- Downloads ONNX models if they do not exist.
- Converts YOLOv7/YOLOv9 ONNX model to TensorRT engine with FP16 precision.
- Updates configurations in the Triton Server config files.
- Starts Triton Inference Server.
Important Note: Building TensorRT engines for each model can take more than 15 minutes. If TensorRT engines already exist, this script reuses them. Users can utilize the
--force
flag to trigger a fresh rebuild of the models.
example:
cd /apps
bash ./start-triton-server.sh \
--models yolov9-c,yolov7 \
--model_mode eval \
--plugin efficientNMS \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1
example:
cd /apps
bash ./start-triton-server.sh \
--models yolov9-c,yolov7 \
--model_mode infer \
--plugin efficientNMS \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1
example:
cd /apps
bash ./start-triton-server.sh \
--models yolov9-c,yolov7 \
--model_mode infer \
--plugin none \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1
After running script, you can verify the availability of the model by checking this output::
+----------+---------+--------+
| yolov7 | 1 | READY |
| yolov7x | 1 | READY |
| yolov9-c | 1 | READY |
| yolov9-e | 1 | READY |
+----------+---------+--------+
This repo does not export pytorch models to ONNX.
You can use the Official Yolov7 Repository
python export.py --weights yolov7.pt \
--grid \
--end2end \
--dynamic-batch \
--simplify \
--topk-all 100 \
--iou-thres 0.65 \
--conf-thres 0.35 \
--img-size 640 640
This repo does not export pytorch models to ONNX.
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
model.export(format="onnx_trt")
This repo does not export pytorch models to ONNX.
python3 export.py \
--weights ./yolov9-c.pt \
--imgsz 640 \
--topk-all 100 \
--iou-thres 0.65 \
--conf-thres 0.35 \
--include onnx_end2end
See Triton Model Configuration Documentation for more info.
Example of Yolo Configuration
Note:
- The values of 100 in the det_boxes/det_scores/det_classes arrays represent the topk-all.
- The setting "max_queue_delay_microseconds: 30000" is optimized for a 30fps input rate.
name: "yolov9-c"
platform: "tensorrt_plan"
max_batch_size: 8
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, 640, 640 ]
}
]
output [
{
name: "num_dets"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "det_boxes"
data_type: TYPE_FP32
dims: [ 100, 4 ]
},
{
name: "det_scores"
data_type: TYPE_FP32
dims: [ 100 ]
},
{
name: "det_classes"
data_type: TYPE_INT32
dims: [ 100 ]
}
]
instance_group [
{
count: 4
kind: KIND_GPU
gpus: [ 0 ]
}
]
version_policy: { latest: { num_versions: 1}}
dynamic_batching {
max_queue_delay_microseconds: 3000
}
See Triton Model Analyzer Documentation for more info.
On the Host Machine:
docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:23.08-py3-sdk /bin/bash
$ ./install/bin/perf_analyzer -m yolov7 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 1
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 7524
Throughput: 417.972 infer/sec
Avg latency: 2391 usec (standard deviation 1235 usec)
p50 latency: 2362 usec
p90 latency: 2460 usec
p95 latency: 2484 usec
p99 latency: 2669 usec
Avg gRPC time: 2386 usec ((un)marshal request/response 4 usec + response wait 2382 usec)
Server:
Inference count: 7524
Execution count: 7524
Successful request count: 7524
Avg request latency: 2280 usec (overhead 30 usec + queue 18 usec + compute input 972 usec + compute infer 1223 usec + compute output 36 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 417.972 infer/sec, latency 2391 usec