v14.test4: latest TensorRT and ONNX Runtime libraries
Pre-releaseThis is a preview release for TensorRT 10.0.0, following the v14.test
, v14.test2
and v14.test3
releases.
-
The
TRT
backend no longer supports Maxwell and Pascal GPUs. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525. -
Added support for SwinIR models for image restoration, which are only supported by the
TRT
backend and theORT_CPU
backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation. -
Added support for SCUNet models for image denoising, which are only supported by the
TRT
backend and theORT_CPU
backend from vs-mlrt v14.test4 or later. -
Added
engine_folder
argument to theTRT
backend in vsmlrt.py to specify custom directory for engines. -
Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.
-
The
ORT_*
backends now support fp16 I/O. The semantics of thefp16
flag is as follows:- Enabling
fp16
will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by theoutput_format
option (0 = fp32, 1 = fp16
). - Disabling
fp16
will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
- Enabling
-
Reduce the overhead of the
ORT_CUDA
backend. -
Added support for TF32 acceleration to the
ORT_CUDA
backend. Disabled by default. -
Add experimental
prefer_nhwc
flag to theORT_CUDA
backend to reduce the number of layout transformations when using tensor cores. -
For production use of the
TRT
backend, continue to use vsmlrt v13.2. For RIFE and SAFA acceleration on theTRT
backend, continue to use any old release. -
Also check the release notes of the previous pre-releases.
benchmark 1
- RTX 4090
- processor clock @ 2520 MHz
- Intel Icelake server @ 2100 MHz
- Driver 551.86
- Windows 10 21H2 (19044.1415)
- TensorRT 10.0.0
- VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3
1920x1080 rgbs, CUDA graphs enabled, fp16
Measurements: FPS / Device Memory (MB)
general
model | 1 stream | 2 streams | 3 streams |
---|---|---|---|
dpir gray | 22.05 / 1818.796 | 25.30 / 3111.114 | 25.33 / 4403.488 |
dpir color | 18.30 / 1851.632 | 25.13 / 3176.808 | 25.17 / 4501.984 |
waifu2x upconv_7_{anime_style_art_rgb, photo} | 20.45 / 2148.716 | 41.22 / 3867.240 | 61.21 / 5585.764 |
waifu2x upresnet10 | 17.91 / 1716.588 | 34.53 / 2941.540 | 42.33 / 4166.492 |
waifu2x cunet / cugan | 13.89 / 4391.292 | 25.74 / 8346.248 | 25.96 / 12301.202 |
waifu2x swin_unet | 4.62 / 7436.692 | 5.43 / 14426.812 | 5.43 / 21412.840 |
real-esrgan (v2/v3, xsx2) | 17.06 / 1087.844 | 33.41 / 1778.264 | 38.26 / 2468.684 |
scunet gray | 5.29 / 3590.320 | 5.40 / 6678.768 | 5.40 / 9767.208 |
scunet color | 5.13 / 3555.568 | 5.48 / 6611.308 | 5.47 / 9667.048 |
swinir-s (2x, color) | 1.63 / 15897.048 | N/A | N/A |
swinir-m* (2x, color, 720p) | 1.05 / 11305.268 | N/A | N/A |
swinir-l* (4x, color, 720p) | 0.61 / 15391.316 | N/A | N/A |
*: swinir-m and swinir-l exhibit precision issues.
rife
v2, fp16 i/o
version | 1 stream | 2 streams | 3 streams | 4 streams | 5 streams |
---|---|---|---|---|---|
v4.4-v4.5 | 136.92/778.432 | 273.80/1149.204 | 414.80/1522.028 | 553.70/1892.796 | 574.31/2263.568 |
v4.6 | 136.01/800.960 | 275.26/1192.212 | 411.01/1585.516 | 544.30/1979.764 | 550.01/2368.020 |
v4.7-v4.9 | 98.20/1302.724 | 195.78/2187.548 | 210.12/3074.420 | 210.45/3957.196 | 210.66/4844.068 |
v4.10-v4.15 | 84.41/1595.592 | 160.93/2773.280 | 161.96/3953.020 | 162.04/5132.760 | 162.07/6310.448 |
{v4.12, v4.13, v4.15, v4.16}_lite | 93.39/1333.444 | 187.32/2255.132 | 197.71/3178.872 | 198.01/4098.508 | 197.95/5022.248 |
v4.14 lite | 81.83/1595.292 | 153.40/2779.424 | 154.19/3963.260 | 154.28/5149.140 | 154.30/6332.980 |
benchmark 2
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | ORT_CUDA NCHW | ORT_CUDA NHWC | ORT_DML |
---|---|---|---|
dpir color | 4.54 / 2573.3 | 5.98 / 2470.9 | 8.45 / 2364.5 |
dpir color (2 streams) | 4.66 / 4854.9 | 6.30 / 4680.8 | 9.48 / 4630.9 |
waifu2x upconv7 | 10.98 / 5432.5 | 3.18 / 3017.8 | 12.48 / 4493.0 |
waifu2x upconv7 (2 streams) | 14.96 / 10397.1 | 3.25 / 5780.9 | 21.72 / 8891.7 |
waifu2x cunet / cugan | 4.70 / 7955.6 | 4.49 / 6290.6 | OOM |
waifu2x cunet / cugan (2 streams) | 5.11 / 15721.9 | 4.78 / 12312.0 | OOM |
waifu2x swin_unet_art | 2.98 / 23518.5 | 3.05 / 22812.0 | N/A |
realesrgan | 8.99 / 1647.7 | 11.20 / 1127.5 | 11.99 / 1346.6 |
realesrgan (2 streams) | 10.69 / 3034.5 | 13.58 / 1994.1 | 17.34 / 2601.6 |
rife v4.4 (1920x1088) | 61.42 / 1100.9 | 56.02 / 1162.3 | 44.73 / 882.4 |
rife v4.4 (1920x1088, 2 streams) | 106.48 / 1953.4 | 92.88 / 2071.9 | 68.80 / 1670.7 |
scunet color | N/A | N/A | N/A |
benchmark 3
NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | TRT | ORT_CUDA | ORT_DML | ORT_CUDA NHWC |
---|---|---|---|---|
dpir color (1 stream) | 7.08 / 1899 | 3.10 / 2602 | 4.99 / 2341 | 4.26 / 2411 |
dpir color (2 streams) | 8.06 / 3376 | 3.30 / 5016 | 5.85 / 4619 | 4.74 / 4650 |
waifu2x upconv7 (1 stream) | 11.47 / 2014 | 7.01 / 4949 | 7.45 / 4501 | 1.59 / 2923 |
waifu2x upconv7 (2 streams) | 21.44 / 3782 | 10.11 / 9732 | 13.23 / 8940 | 1.77 / 5674 |
waifu2x cunet / cugan (1 stream) | 7.41 / 4664 | 3.10 / 10067 | OOM | 0.77 / 6188 |
waifu2x cunet / cugan (2 streams) | 10.92 / 8863 | OOM | OOM | OOM |
waifu2x swin_unet_art (1 stream) | 2.35 / 7234 | OOM | N/A | OOM |
waifu2x swin_unet_art (2 streams) | OOM | OOM | N/A | OOM |
realesrgan (1 stream) | 8.66 / 1268 | 5.33 / 1545 | 6.39 / 1316 | 6.96 / 1033 |
realesrgan (2 streams) | 13.20 / 2166 | 7.78 / 2932 | 10.22 / 2571 | 10.25 / 1895 |
rife v4.4 (1920x1088, fp16 i/o, 1 stream) | 64.97 / 609 | 46.60 / 967 | 32.18 / 723 | 48.55 / 1014 |
rife v4.4 (1920x1088, fp16 i/o, 2 streams) | 127.38 / 1027 | 69.77 / 1868 | 51.01 / 1385 | 76.10 / 1054 |
scunet color (1 stream) | 2.73 / 3829 | N/A | N/A | N/A |
scunet color (2 streams) | 2.85 / 7165 | N/A | N/A | N/A |
Version information:
- This pre-release uses trt 10.0.0 + cuda 12.4.0 + cudnn 8.9.7 + ort 1.18, which requires a minimum driver version of 525 and is compatible with 16/20 series and newer GPUs. The engine compilation time is reduced by up to 40%, but the runtime performance of RIFE models is worsen by up to 30% with nearly doubled gpu memory usage.
vsmlrt.py
in all branches can be used interchangeably.