You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NMS Layers are much slower on TensorRT than on PyTorch (44% of the performance) and I'm looking for any possible workaround. This seems to be acknowledged as a known issue in the TensorRT release notes here:
A performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations
Is there any possible workaround or a fix planned in a specific future version? I am specifically using these layers inside a FasterRCNN network (as implemented in torchvisionhere). I observe this network to be much slower when running either with a single image or 4 images:
Single image inference latency: 7.8ms on PyTorch, 13.3ms on TensorRT
4 image inference latency: 22.8ms on PyTorch, 53.5ms on TensorRT
When I run this network with per-layer profiling, I see that the NonMaxSuppression layers account for 75%+ of the overall inference time. I have verified this on TensorRT 10.0 and 10.6. I have tested using ONNX opset 11 and opset 17.
Compare trtexec output to PyTorch equivalent benchmark
Commands or scripts:
Have you tried the latest release?: Yes I have tried TensorRT 10.6 and 10.0
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes it runs on onnxruntime.
The text was updated successfully, but these errors were encountered:
If your nms as the last module of network, I suggest you separate it from the network. nms run with cuda-kernel by user can achieve better performance .
I've been able to integrate the deprecated EfficientNMSPlugin into FasterRCNN as an attempted workaround for the slow builtin NMS Layers. It is faster than the stock TensorRT NMS layer but still significantly slower than the torchscript equivalent. In my environment I observe the model running at 53.5ms with the builtin layer and 30ms with EfficientNMS; pytorch torchscript runs at 22ms.
If your nms as the last module of network, I suggest you separate it from the network. nms run with cuda-kernel by user can achieve better performance .
This is a good suggestion but unfortunately, in FasterRCNN it uses NMS layers in a couple places, one of which is not at the end of the model (here and here)
Description
NMS Layers are much slower on TensorRT than on PyTorch (44% of the performance) and I'm looking for any possible workaround. This seems to be acknowledged as a known issue in the TensorRT release notes here:
Is there any possible workaround or a fix planned in a specific future version? I am specifically using these layers inside a
FasterRCNN
network (as implemented intorchvision
here). I observe this network to be much slower when running either with a single image or 4 images:When I run this network with per-layer profiling, I see that the
NonMaxSuppression
layers account for 75%+ of the overall inference time. I have verified this on TensorRT 10.0 and 10.6. I have tested using ONNX opset 11 and opset 17.Environment
TensorRT Version: 10.0, 10.6
NVIDIA GPU: GeForce RTX 4090
NVIDIA Driver Version: 550.54.15
CUDA Version: 12.4
CUDNN Version: unsure
Operating System:
Python Version (if applicable): 3.9
Tensorflow Version (if applicable):
PyTorch Version (if applicable): 2.2
Baremetal or Container (if so, version):
Relevant Files
Model link: https://pytorch.org/vision/main/models/faster_rcnn.html
Steps To Reproduce
trtexec
trtexec
output to PyTorch equivalent benchmarkCommands or scripts:
Have you tried the latest release?: Yes I have tried TensorRT 10.6 and 10.0
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): Yes it runs on onnxruntime.The text was updated successfully, but these errors were encountered: