Triton Server YOLO Models

This repository serves as an example of deploying the YOLO models on Triton Server for performance and testing purposes. It includes support for applications developed using Nvidia DeepStream.
Currently, only YOLOv7, YOLOv7 QAT, YOLOv8, YOLOv9 and YOLOv9 QAT are supported.

For YOLOv8 Models we are using custom plugin YOLO_NMS_TRT, the End2End implementation is not available on official Repo. Only on https://github.com/levipereira/ultralytics

Triton Client Repository

For testing and evaluating YOLO models, you can utilize the repository triton-client-yolo

Evaluation Test on TensorRT

Evaluation test was perfomed using this client

Models Details

YOLOv9

Model ONNX > TensorRT	Test Size	AP^val	AP₅₀^val	AP₇₅^val
YOLOv9-C (FP16)	640	52.9%	70.1%	57.7%
YOLOv9-C ReLU (FP16)	640	51.7%	68.8%	56.3%
YOLOv9-E (FP16)	640	55.4%	72.6%	60.3%
YOLOv9-C QAT	640	52.7%	69.8%	57.5%
YOLOv9-C ReLU QAT	640	51.6%	69.7%	56.3%
YOLOv9-E QAT	640	55.3%	72.4%	60.2%

YOLOv8

Model ONNX > TensorRT	Test Size	AP^val	AP₅₀^val	AP₇₅^val
YOLOv8n (FP16)	640	37.3%	52.6%	40.5%
YOLOv8s (FP16)	640	44.9%	61.6%	48.6%
YOLOv8m (FP16)	640	50.1%	67%	54.6%
YOLOv8l (FP16)	640	52.7%	69.6%	57.4%
YOLOv8x (FP16)	640	53.7%	70.7%	58.6%

YOLOv7

Model ONNX > TensorRT	Test Size	AP^val	AP₅₀^val	AP₇₅^val
YOLOv7 (FP16)	640	51.1%	69.3%	55.6%
YOLOv7x (FP16)	640	52.9%	70.8%	57.4%
YOLOv7-QAT (INT8)	640	50.9%	69.2%	55.5%
YOLOv7x-QAT (INT8)	640	52.5%	70.6%	57.3%

Description

This repository utilizes exported models using ONNX. It offers two types of ONNX models

Model with Dynamic Shape and Dynamic Batch Size with End2End using Efficient NMS or YOLO_NMS_TRT plugin
It offers two types of configuration models:

A model optimized for evaluation (--topk-all 300 --iou-thres 0.7 --conf-thres 0.001).
A model optimized for inference (--topk-all 100 --iou-thres 0.45 --conf-thres 0.25).

Model with Dynamic Shapes and Dynamic Batching without End2End.

The Non-Maximum Suppression must be handled by Client

Detailed Models can be found here

Starting NVIDIA Triton Inference Server Container with Docker

Prerequisites

Docker with support NVIDIA Container Toolkit must be installed.
NVIDIA GPU(s) should be available.

Usage

git clone https://github.com/levipereira/triton-server-yolo.git
cd triton-server-yolo
# Start Docker container
bash ./start-container-triton-server.sh

Install compiled `libnvinfer_plugin` with YoloNMS

Download and install compiled libnvinfer_plugin

cd TensorRTPlugin
./patch_libnvinfer.sh --download
cd ..

Or build from source the libnvinfer_plugin
TensorRTPlugin

Starting Triton Inference Server

Inside Docker Container use bash ./start-triton-server.sh

Script Usage

This script is used to build TensorRT engines and start Triton-Server for YOLO models.

cd /apps
bash ./start-triton-server.sh

Script Options

--models: Specify the YOLO model name(s). Choose one or more with comma separation. Available options:
yolov9-c
yolov9-c-relu
yolov9-c-qat
yolov9-c-relu-qat
yolov9-e
yolov9-e-qat
yolov8n
yolov8s
yolov8m
yolov8l
yolov8x
yolov7
yolov7-qat
yolov7x
yolov7x-qat
--model_mode: Use Model ONNX optimized for EVALUATION or INFERENCE. Choose from 'eval' or 'infer'.
--plugin: Options: 'efficientNMS' or 'yoloNMS' or 'none'.
--opt_batch_size: Specify the optimal batch size for TensorRT engines.
--max_batch_size: Specify the maximum batch size for TensorRT engines.
--instance_group: Specify the number of TensorRT engine instances loaded per model in the Triton Server.
--force: Rebuild TensorRT engines even if they already exist.
--reset_all: Purge all existing TensorRT engines and their respective configurations.

Script Flow:

Checks for the existence of YOLOv7/YOLOv9 ONNX model files.
Downloads ONNX models if they do not exist.
Converts YOLOv7/YOLOv9 ONNX model to TensorRT engine with FP16 precision.
Updates configurations in the Triton Server config files.
Starts Triton Inference Server.

Important Note: Building TensorRT engines for each model can take more than 15 minutes. If TensorRT engines already exist, this script reuses them. Users can utilize the --force flag to trigger a fresh rebuild of the models.

Starting Triton Server using Models for Evaluation Purposes

example:

cd /apps
bash ./start-triton-server.sh  \
--models yolov9-c,yolov7 \
--model_mode eval \
--plugin efficientNMS \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1

Starting Triton Server using Models for Inference Purposes

example:

cd /apps
bash ./start-triton-server.sh  \
--models yolov9-c,yolov7 \
--model_mode infer \
--plugin efficientNMS \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1

Starting Triton Server using models for Inference Purposes without Efficient NMS.

example:

cd /apps
bash ./start-triton-server.sh  \
--models yolov9-c,yolov7 \
--model_mode infer \
--plugin none \
--opt_batch_size 4 \
--max_batch_size 4 \
--instance_group 1

After running script, you can verify the availability of the model by checking this output::

+----------+---------+--------+
| yolov7   | 1       | READY  |
| yolov7x  | 1       | READY  |
| yolov9-c | 1       | READY  |
| yolov9-e | 1       | READY  |
+----------+---------+--------+

Manual ONNX YOLOv7/v9 exports

Exporting YOLOv7 Series from PyTorch to ONNX With Efficient NMS plugin

This repo does not export pytorch models to ONNX.
You can use the Official Yolov7 Repository

python export.py --weights yolov7.pt \
  --grid \
  --end2end \
  --dynamic-batch \
  --simplify \
  --topk-all 100 \
  --iou-thres 0.65 \
  --conf-thres 0.35 \
  --img-size 640 640

Exporting YOLOv8 Series from PyTorch YOLOv8 to ONNX With YOLO_NMS plugin

This repo does not export pytorch models to ONNX.

from ultralytics import YOLO
model = YOLO("yolov8n.pt") 
model.export(format="onnx_trt")

Exporting YOLOv9 Series from PyTorch YOLOv9 to ONNX With Efficient NMS plugin

This repo does not export pytorch models to ONNX.

python3 export.py \
   --weights ./yolov9-c.pt \
   --imgsz 640 \
   --topk-all 100 \
   --iou-thres 0.65 \
   --conf-thres 0.35 \
   --include onnx_end2end

Additional Configurations

See Triton Model Configuration Documentation for more info.

Example of Yolo Configuration

Note:

The values of 100 in the det_boxes/det_scores/det_classes arrays represent the topk-all.
The setting "max_queue_delay_microseconds: 30000" is optimized for a 30fps input rate.

name: "yolov9-c"
platform: "tensorrt_plan"
max_batch_size: 8
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]
output [
  {
    name: "num_dets"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "det_boxes"
    data_type: TYPE_FP32
    dims: [ 100, 4 ]
  },
  {
    name: "det_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]
  },
  {
    name: "det_classes"
    data_type: TYPE_INT32
    dims: [ 100 ]
  }
]
instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]
version_policy: { latest: { num_versions: 1}}
dynamic_batching {
  max_queue_delay_microseconds: 3000
}

Running Performance with Model Analyzer

See Triton Model Analyzer Documentation for more info.

On the Host Machine:

docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:23.08-py3-sdk /bin/bash

$ ./install/bin/perf_analyzer -m yolov7 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 1
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 7524
    Throughput: 417.972 infer/sec
    Avg latency: 2391 usec (standard deviation 1235 usec)
    p50 latency: 2362 usec
    p90 latency: 2460 usec
    p95 latency: 2484 usec
    p99 latency: 2669 usec
    Avg gRPC time: 2386 usec ((un)marshal request/response 4 usec + response wait 2382 usec)
  Server:
    Inference count: 7524
    Execution count: 7524
    Successful request count: 7524
    Avg request latency: 2280 usec (overhead 30 usec + queue 18 usec + compute input 972 usec + compute infer 1223 usec + compute output 36 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 417.972 infer/sec, latency 2391 usec

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
TensorRTPlugin		TensorRTPlugin
models_config		models_config
models_onnx		models_onnx
perfomance		perfomance
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
start-container-triton-server.sh		start-container-triton-server.sh
start-triton-server.sh		start-triton-server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton Server YOLO Models

Triton Client Repository

Evaluation Test on TensorRT

YOLOv9

YOLOv8

YOLOv7

Description

Starting NVIDIA Triton Inference Server Container with Docker

Prerequisites

Usage

Install compiled `libnvinfer_plugin` with YoloNMS

Starting Triton Inference Server

Script Usage

Script Options

Script Flow:

Starting Triton Server using Models for Evaluation Purposes

Starting Triton Server using Models for Inference Purposes

Starting Triton Server using models for Inference Purposes without Efficient NMS.

Manual ONNX YOLOv7/v9 exports

Exporting YOLOv7 Series from PyTorch to ONNX With Efficient NMS plugin

Exporting YOLOv8 Series from PyTorch YOLOv8 to ONNX With YOLO_NMS plugin

Exporting YOLOv9 Series from PyTorch YOLOv9 to ONNX With Efficient NMS plugin

Additional Configurations

Running Performance with Model Analyzer

About

Releases 1

Packages

Contributors 2

Languages

License

levipereira/triton-server-yolo

Folders and files

Latest commit

History

Repository files navigation

Triton Server YOLO Models

Triton Client Repository

Evaluation Test on TensorRT

YOLOv9

YOLOv8

YOLOv7

Description

Starting NVIDIA Triton Inference Server Container with Docker

Prerequisites

Usage

Install compiled libnvinfer_plugin with YoloNMS

Starting Triton Inference Server

Script Usage

Script Options

Script Flow:

Starting Triton Server using Models for Evaluation Purposes

Starting Triton Server using Models for Inference Purposes

Starting Triton Server using models for Inference Purposes without Efficient NMS.

Manual ONNX YOLOv7/v9 exports

Exporting YOLOv7 Series from PyTorch to ONNX With Efficient NMS plugin

Exporting YOLOv8 Series from PyTorch YOLOv8 to ONNX With YOLO_NMS plugin

Exporting YOLOv9 Series from PyTorch YOLOv9 to ONNX With Efficient NMS plugin

Additional Configurations

Running Performance with Model Analyzer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Install compiled `libnvinfer_plugin` with YoloNMS

Packages