Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

darrin-willis · 2024-11-14T22:00:06Z

Description

NMS Layers are much slower on TensorRT than on PyTorch (44% of the performance) and I'm looking for any possible workaround. This seems to be acknowledged as a known issue in the TensorRT release notes here:

A performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations

Is there any possible workaround or a fix planned in a specific future version? I am specifically using these layers inside a FasterRCNN network (as implemented in torchvision here). I observe this network to be much slower when running either with a single image or 4 images:

Single image inference latency: 7.8ms on PyTorch, 13.3ms on TensorRT
4 image inference latency: 22.8ms on PyTorch, 53.5ms on TensorRT

When I run this network with per-layer profiling, I see that the NonMaxSuppression layers account for 75%+ of the overall inference time. I have verified this on TensorRT 10.0 and 10.6. I have tested using ONNX opset 11 and opset 17.

Environment

TensorRT Version: 10.0, 10.6

NVIDIA GPU: GeForce RTX 4090

NVIDIA Driver Version: 550.54.15

CUDA Version: 12.4

CUDNN Version: unsure

Operating System:

Python Version (if applicable): 3.9

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.2

Baremetal or Container (if so, version):

Relevant Files

Model link: https://pytorch.org/vision/main/models/faster_rcnn.html

Steps To Reproduce

Export FasterRCNN to ONNX
Pass ONNX into trtexec
Compare trtexec output to PyTorch equivalent benchmark

Commands or scripts:

Have you tried the latest release?: Yes I have tried TensorRT 10.6 and 10.0

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes it runs on onnxruntime.

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-11-18T05:41:16Z

If your nms as the last module of network, I suggest you separate it from the network. nms run with cuda-kernel by user can achieve better performance .

darrin-willis · 2024-11-20T00:39:48Z

I've been able to integrate the deprecated EfficientNMSPlugin into FasterRCNN as an attempted workaround for the slow builtin NMS Layers. It is faster than the stock TensorRT NMS layer but still significantly slower than the torchscript equivalent. In my environment I observe the model running at 53.5ms with the builtin layer and 30ms with EfficientNMS; pytorch torchscript runs at 22ms.

If your nms as the last module of network, I suggest you separate it from the network. nms run with cuda-kernel by user can achieve better performance .

This is a good suggestion but unfortunately, in FasterRCNN it uses NMS layers in a couple places, one of which is not at the end of the model (here and here)

poweiw added Performance General performance issues triaged Issue has been triaged by maintainers labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

darrin-willis commented Nov 14, 2024

lix19937 commented Nov 18, 2024

darrin-willis commented Nov 20, 2024

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

Comments

darrin-willis commented Nov 14, 2024

Description

Environment

Relevant Files

Steps To Reproduce

lix19937 commented Nov 18, 2024

darrin-willis commented Nov 20, 2024