Release v14.test4: latest TensorRT and ONNX Runtime libraries · AmusementClub/vs-mlrt

This is a preview release for TensorRT 10.0.0, following the v14.test, v14.test2 and v14.test3 releases.

The TRT backend no longer supports Maxwell and Pascal GPUs. Other backends still support these GPUs. Same as those releases, the current release requires driver version >= 525.
Added support for SwinIR models for image restoration, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later. SwinIR-M and SwinIR-L models exhibit precision issue with fp16 implementation, this is under investigation.
Added support for SCUNet models for image denoising, which are only supported by the TRT backend and the ORT_CPU backend from vs-mlrt v14.test4 or later.
Added engine_folder argument to the TRT backend in vsmlrt.py to specify custom directory for engines.
Starting with this pre-release, for dynamically shaped engines, the trt runtime allocates gpu memory based on the actual tile size, whereas in previous releases, the runtime would have to allocate gpu memory based on the maximum tile size set at engine compile time. This feature requires TensorRT 10 or later.
The ORT_* backends now support fp16 I/O. The semantics of the fp16 flag is as follows:
- Enabling fp16 will use a built-in quantization that converts a fp32 onnx to a fp16 onnx. If the input video is of half-precision floating-point format, the generated fp16 onnx will use fp16 input. The output format can be controlled by the output_format option (0 = fp32, 1 = fp16).
- Disabling fp16 will not use the built-in quantization. However, if the onnx file itself uses fp16 for computation, the actual computation will be done in fp16. In this case, the input video format should match the input format of the onnx, and the output format is inferred from the onnx.
Reduce the overhead of the ORT_CUDA backend.
Added support for TF32 acceleration to the ORT_CUDA backend. Disabled by default.
Add experimental prefer_nhwc flag to the ORT_CUDA backend to reduce the number of layout transformations when using tensor cores.
For production use of the TRT backend, continue to use vsmlrt v13.2. For RIFE and SAFA acceleration on the TRT backend, continue to use any old release.
Also check the release notes of the previous pre-releases.

benchmark 1

RTX 4090
- processor clock @ 2520 MHz
Intel Icelake server @ 2100 MHz
Driver 551.86
Windows 10 21H2 (19044.1415)
TensorRT 10.0.0
VapourSynth-Classic R57.A8, vapoursynth-plugin v0.96g3

1920x1080 rgbs, CUDA graphs enabled, fp16

Measurements: FPS / Device Memory (MB)

general

model	1 stream	2 streams	3 streams
dpir gray	22.05 / 1818.796	25.30 / 3111.114	25.33 / 4403.488
dpir color	18.30 / 1851.632	25.13 / 3176.808	25.17 / 4501.984

waifu2x upconv_7_{anime_style_art_rgb, photo}	20.45 / 2148.716	41.22 / 3867.240	61.21 / 5585.764
waifu2x upresnet10	17.91 / 1716.588	34.53 / 2941.540	42.33 / 4166.492
waifu2x cunet / cugan	13.89 / 4391.292	25.74 / 8346.248	25.96 / 12301.202
waifu2x swin_unet	4.62 / 7436.692	5.43 / 14426.812	5.43 / 21412.840

real-esrgan (v2/v3, xsx2)	17.06 / 1087.844	33.41 / 1778.264	38.26 / 2468.684

scunet gray	5.29 / 3590.320	5.40 / 6678.768	5.40 / 9767.208
scunet color	5.13 / 3555.568	5.48 / 6611.308	5.47 / 9667.048

swinir-s (2x, color)	1.63 / 15897.048	N/A	N/A
swinir-m* (2x, color, 720p)	1.05 / 11305.268	N/A	N/A
swinir-l* (4x, color, 720p)	0.61 / 15391.316	N/A	N/A

*: swinir-m and swinir-l exhibit precision issues.

rife

v2, fp16 i/o

version	1 stream	2 streams	3 streams	4 streams	5 streams
v4.4-v4.5	136.92/778.432	273.80/1149.204	414.80/1522.028	553.70/1892.796	574.31/2263.568
v4.6	136.01/800.960	275.26/1192.212	411.01/1585.516	544.30/1979.764	550.01/2368.020
v4.7-v4.9	98.20/1302.724	195.78/2187.548	210.12/3074.420	210.45/3957.196	210.66/4844.068
v4.10-v4.15	84.41/1595.592	160.93/2773.280	161.96/3953.020	162.04/5132.760	162.07/6310.448
{v4.12, v4.13, v4.15, v4.16}_lite	93.39/1333.444	187.32/2255.132	197.71/3178.872	198.01/4098.508	197.95/5022.248
v4.14 lite	81.83/1595.292	153.40/2779.424	154.19/3963.260	154.28/5149.140	154.30/6332.980

benchmark 2

previous benchmark

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 552.22, Windows Server 2022, Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	ORT_CUDA NCHW	ORT_CUDA NHWC	ORT_DML
dpir color	4.54 / 2573.3	5.98 / 2470.9	8.45 / 2364.5
dpir color (2 streams)	4.66 / 4854.9	6.30 / 4680.8	9.48 / 4630.9
waifu2x upconv7	10.98 / 5432.5	3.18 / 3017.8	12.48 / 4493.0
waifu2x upconv7 (2 streams)	14.96 / 10397.1	3.25 / 5780.9	21.72 / 8891.7
waifu2x cunet / cugan	4.70 / 7955.6	4.49 / 6290.6	OOM
waifu2x cunet / cugan (2 streams)	5.11 / 15721.9	4.78 / 12312.0	OOM
waifu2x swin_unet_art	2.98 / 23518.5	3.05 / 22812.0	N/A
realesrgan	8.99 / 1647.7	11.20 / 1127.5	11.99 / 1346.6
realesrgan (2 streams)	10.69 / 3034.5	13.58 / 1994.1	17.34 / 2601.6
rife v4.4 (1920x1088)	61.42 / 1100.9	56.02 / 1162.3	44.73 / 882.4
rife v4.4 (1920x1088, 2 streams)	106.48 / 1953.4	92.88 / 2071.9	68.80 / 1670.7
scunet color	N/A	N/A	N/A

benchmark 3

NVIDIA GeForce RTX 2080 Ti, 4352 shaders @ 1700 MHz, driver 552.22, Windows 10 LTSC 21H2 (19044.1415), Python 3.11.9, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	TRT	ORT_CUDA	ORT_DML	ORT_CUDA NHWC
dpir color (1 stream)	7.08 / 1899	3.10 / 2602	4.99 / 2341	4.26 / 2411
dpir color (2 streams)	8.06 / 3376	3.30 / 5016	5.85 / 4619	4.74 / 4650

waifu2x upconv7 (1 stream)	11.47 / 2014	7.01 / 4949	7.45 / 4501	1.59 / 2923
waifu2x upconv7 (2 streams)	21.44 / 3782	10.11 / 9732	13.23 / 8940	1.77 / 5674

waifu2x cunet / cugan (1 stream)	7.41 / 4664	3.10 / 10067	OOM	0.77 / 6188
waifu2x cunet / cugan (2 streams)	10.92 / 8863	OOM	OOM	OOM

waifu2x swin_unet_art (1 stream)	2.35 / 7234	OOM	N/A	OOM
waifu2x swin_unet_art (2 streams)	OOM	OOM	N/A	OOM

realesrgan (1 stream)	8.66 / 1268	5.33 / 1545	6.39 / 1316	6.96 / 1033
realesrgan (2 streams)	13.20 / 2166	7.78 / 2932	10.22 / 2571	10.25 / 1895

rife v4.4 (1920x1088, fp16 i/o, 1 stream)	64.97 / 609	46.60 / 967	32.18 / 723	48.55 / 1014
rife v4.4 (1920x1088, fp16 i/o, 2 streams)	127.38 / 1027	69.77 / 1868	51.01 / 1385	76.10 / 1054

scunet color (1 stream)	2.73 / 3829	N/A	N/A	N/A
scunet color (2 streams)	2.85 / 7165	N/A	N/A	N/A

Version information:

This pre-release uses trt 10.0.0 + cuda 12.4.0 + cudnn 8.9.7 + ort 1.18, which requires a minimum driver version of 525 and is compatible with 16/20 series and newer GPUs. The engine compilation time is reduced by up to 40%, but the runtime performance of RIFE models is worsen by up to 30% with nearly doubled gpu memory usage.
vsmlrt.py in all branches can be used interchangeably.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v14.test4: latest TensorRT and ONNX Runtime libraries

benchmark 1

general

rife

benchmark 2

benchmark 3