Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

Open
darrin-willis opened this issue Nov 14, 2024 · 2 comments
Open

Non-Maximal-Suppression (NMS) Layers slow on TensorRT 10.0-10.6 #4248

darrin-willis opened this issue Nov 14, 2024 · 2 comments
Labels
Performance General performance issues triaged Issue has been triaged by maintainers

Comments

@darrin-willis
Copy link

Description

NMS Layers are much slower on TensorRT than on PyTorch (44% of the performance) and I'm looking for any possible workaround. This seems to be acknowledged as a known issue in the TensorRT release notes here:

A performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations

Is there any possible workaround or a fix planned in a specific future version? I am specifically using these layers inside a FasterRCNN network (as implemented in torchvision here). I observe this network to be much slower when running either with a single image or 4 images:

  • Single image inference latency: 7.8ms on PyTorch, 13.3ms on TensorRT
  • 4 image inference latency: 22.8ms on PyTorch, 53.5ms on TensorRT

When I run this network with per-layer profiling, I see that the NonMaxSuppression layers account for 75%+ of the overall inference time. I have verified this on TensorRT 10.0 and 10.6. I have tested using ONNX opset 11 and opset 17.

Environment

TensorRT Version: 10.0, 10.6

NVIDIA GPU: GeForce RTX 4090

NVIDIA Driver Version: 550.54.15

CUDA Version: 12.4

CUDNN Version: unsure

Operating System:

Python Version (if applicable): 3.9

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.2

Baremetal or Container (if so, version):

Relevant Files

Model link: https://pytorch.org/vision/main/models/faster_rcnn.html

Steps To Reproduce

  1. Export FasterRCNN to ONNX
  2. Pass ONNX into trtexec
  3. Compare trtexec output to PyTorch equivalent benchmark

Commands or scripts:

Have you tried the latest release?: Yes I have tried TensorRT 10.6 and 10.0

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes it runs on onnxruntime.

@poweiw poweiw added Performance General performance issues triaged Issue has been triaged by maintainers labels Nov 18, 2024
@lix19937
Copy link

If your nms as the last module of network, I suggest you separate it from the network. nms run with cuda-kernel by user can achieve better performance .

@darrin-willis
Copy link
Author

I've been able to integrate the deprecated EfficientNMSPlugin into FasterRCNN as an attempted workaround for the slow builtin NMS Layers. It is faster than the stock TensorRT NMS layer but still significantly slower than the torchscript equivalent. In my environment I observe the model running at 53.5ms with the builtin layer and 30ms with EfficientNMS; pytorch torchscript runs at 22ms.

If your nms as the last module of network, I suggest you separate it from the network. nms run with cuda-kernel by user can achieve better performance .

This is a good suggestion but unfortunately, in FasterRCNN it uses NMS layers in a couple places, one of which is not at the end of the model (here and here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance General performance issues triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants