23 Oct 00:58

github-actions

5aadcab

v14.test2: latest TensorRT library Pre-release

Pre-release

This is a preview release for TensorRT 9.1.0, following v14.test release.

Same as v14.test release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
TensorRT 9.1.0 is officially documented as for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only on Linux. The Windows build is downloaded from here, and can be used on other GPU models.
- ~~On Windows, some users have reported crashes when using it in mpv~~ (#65). This problem occurs on an earlier version of this release, which is now fixed.
Add parameters bf16 (#64), custom_env and custom_args to the TRT backend.
- fp16 execution of Waifu2xModel.swin_unet_art is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
Device memory usage of model Waifu2xModel.swin_unet_art is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.
- TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
- Setting max_aux_streams=3 lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, and max_aux_streams=0 corresponds to ~7.3GB usage.
- TensorRT 9.1.0 with max_aux_streams=0 uses ~6.7GB device memory.
Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.
Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.
- Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
RIFE models with v2 representation for TRT backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.
- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to vsmlrt.RIFE().
- By default vsmlrt.RIFE() in vsmlrt.py uses v1 representation. The v2 representation is enabled with vsmlrt.RIFE(_implementation=2) function call.
  
  Sample Error Message
  input: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
- v2 representation is still considered experimental.
Added support for SAFA v0.1 video enhancement model.
- This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
- Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
- ~17 fps on RTX 4090 with TRT(fp16), 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution.
- This representation is not supported by the NCNN_VK backend for the same issue as RIFE v2 representation.
Also check the release note of v14.test release.

This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
vsmlrt.py in all branches can be used interchangeably.

TensorRT 9.0.1 is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only on x86 Linux. Model Waifu2xModel.swin_unet_artis 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).

This pre-release is now feature complete. Development now switch to the v14.test3 pre-release.

Contributors

charlessuh

Assets 14

13 Mar 12:20

github-actions

v14.test

daf9620

v14.test: latest TensorRT library Pre-release

Pre-release

This is a preview release for TensorRT 8.6.1.

It requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
Add parameters builder_optimization_level and max_aux_streams to the TRT backend.
- builder_optimization_level: "adjust how long TensorRT should spend searching for tactics with potentially better performance" link
- max_aux_streams: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link
  - It is advised to lower max_aux_streams to 0 on heavy models like Waifu2xModel.swin_unet_art to reduce memory usage. Check the benchmark data at the bottom.
Following TensorRT 8.6.1, cudnn tactic source of the TRT backend is disabled by default. tf32 is also disabled by default in vsmlrt.py.
Add parameter short_path to the TRT backend, which shortens engine path and is enabled on Windows by default.
Model Waifu2xModel.swin_unet_art seems does not work with builder_optimization_level=5 from the TRT backend before TRT 9.0. Use builder_optimization_level=4 or lower instead.

Less than 5% performance improvement among built-in models compared to 13.1/13.2, 24% device memory usage reduction on DPIR and 35% on RealESRGAN.

Version information:

v13.2 release uses trt 8.5.1 + cuda 11.8.0, which can run on driver >= 450 and 900 series and later gpus.
v14.test pre-release uses trt 8.6.1 + cuda 12.1.1, which can only run on driver >= 525 and 10 series and later gpus, with no significant performance improvement measured.
vsmlrt.py in both branches can be used interchangeably.

Added support for RIFE v4.7 model ("optimized for anime scenes"), which is also available for previous vs-mlrt releases (simply download the new model file here and update vsmlrt.py). It is more computational intensive than v4.6.

This pre-release is now feature complete. Development now switch to trt-latest branch and v14.test2 pre-release.

Assets 14

28 May 15:01

github-actions

v13.2

d9e4111

v13.2: latest ort library, DirectML backend

Added support for DirectML backend through ONNX Runtime. It is available for all dx12 devices and can be accessed through backend=Backend.ORT_DML(). waifu2x swin_unet models may not be supported on this backend, and RIFE models may be poorly supported.
Asset vsmlrt-windows-x64-vk.*.7z is renamed to vsmlrt-windows-x64-generic-gpu.*.7z and includes backends OV_CPU, OV_GPU, ORT_CPU, ORT_DML, NCNN_VK. cuda asset continue to include all backends in this release.
Update onnxruntime to microsoft/onnxruntime@73584f9.

Note

Backend OV_GPU may produces reduced precision output. This is under investigation.

benchmark 1

NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 532.03, Windows 10 21H2 LTSC (19044.1415), Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	ORT_CUDA	ORT_DML
dpir	4.25 / 2573.3	7.01 / 2371.0
dpir (2 streams)	4.58 / 5506.2	8.85 / 4643.1
waifu2x upconv7	9.10 / 5248.1	9.65 / 4503.1
waifu2x upconv7 (2 streams)	11.15 / 2966.9	18.52 / 8911.2
waifu2x cunet / cugan	4.06 / 7875.7	6.36 / 8973.7
waifu2x cunet / cugan (2 streams)	N/A	9.51 / 17849.1
waifu2x swin_unet_art	N/A	N/A
realesrgan	7.52 / 1901.7	8.54 / 1352.4
realesrgan (2 streams)	11.15 / 2966.9	15.58 / 2608.7
rife	34.30 / 1109.1	2.12 / 1417.8
rife (2 streams)	61.45 / 2051.4	4.27 / 2740.9

benchmark 2

AMD Radeon Pro V620 MxGPU, 4608 shaders @ 2390 MHz, Adrenalin 21.20.02.13, Windows Server 2019, Python 3.11.3, vapoursynth-classic R57.A8

Measurements: (1080p, fp16) FPS / Device Memory (MB)

model	NCNN_VK	ORT_DML
dpir	1.70 / 3248.4	4.75 / 2308.1
dpir (2 streams)	1.74 / 6099.5	4.86 / 4584.6
waifu2x upconv7	5.18 / 6872.3	14.51 / 4448.5
waifu2x upconv7 (2 streams)	6.14 / 13701	15.98 / 8861.2
waifu2x cunet / cugan (2x2 tiles)	1.07 / 3159.8	5.57 / 2196.7
waifu2x cunet / cugan (2x2 tiles, 2 streams)	1.07 / 3159.8	6.08 / 4357.8
waifu2x swin_unet_art	N/A	N/A
realesrgan	3.86 / 2699.7	9.59 / 1290.4
realesrgan (2 streams)	4.43 / 5355.8	10.58 / 2545.3
rife	N/A	2.68 / 1353.5
rife (2 streams)	N/A	4.44 / 2673.3

Assets 14

29 Jan 15:00

github-actions

v13.1

cf2bfbf

v13.1: latest ov & ort library, new models

Update openvino to openvinotoolkit/openvino@b0ffec4 , improved support for RIFE in both OV_CPU and OV_GPU. (benchmark on arc a380)
Fix a typo in vsmlrt.py's __all__.
Added parameter num_threads in the OV_CPU backend
Update onnxruntime to microsoft/onnxruntime@8ed3dfe.
Default value of workspace of the TRT backend will be changed to None in the next release, anduse_cudnn will be changed to False.

the following models can be found on external models:

Add support for waifu2x swin_unet models.
Add support for ensemble configuration of RIFE. (RIFE v2 acceleration is experimental and may results in reduced quality on TRT backend with fp16 enabled.)

Note

Backend OV_GPU may produces reduced precision output. This is under investigation.

Assets 14

20 Apr 04:26

AkarinVS

contrib-models

668058b

Contributed Models Pre-release

Pre-release

Please see PR #42 for policy.

AnimeJaNai_v2.7z: RealESRGANv2/{animejanaiV2L1, animejanaiV2L2, animejanaiV2L3}.onnx, contributed by @hooke007 in #53.
AnimeJaNai_v3.7z: RealESRGANv2/{animejanaiV3-HD-L1, animejanaiV3-HD-L2, animejanaiV3-HD-L3}.onnx, contributed by @hooke007 in #82.
ani4k_v2_v1.7z: RealESRGANv2/{Ani4Kv2_G6i2_Compact, Ani4Kv2_G6i2_UltraCompact}.onnx from Ani4K v2, contributed by @srk24 in #105.

Contributors

srk24 and hooke007

Assets 5

14 Jan 10:49

github-actions

v13

5b7840f

v13: fp16 i/o, faster dynamic shapes for TRT backend

Added support for fp16 I/O format and faster dynamic shapes in the TRT backend.
- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The TRT backend now accepts fp16 clips beyond fp32, and the output format can be specified via parameter output_format (0 for fp32 and 1 for fp16).
As the only portable way to convert fp32 clips to fp16 is via resize (std.Expr only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:
```
th = (src.height + 31) // 32 * 32  # adjust 32 and 31 to match specific AI network input resolution requirements.
tw = (src.width  + 31) // 32 * 32  # same.
padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th)
flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output
oh = src.height * (flt.height // th)  # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers.
ow = src.width  * (flt.width  // tw)
res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
```
- Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).
  
  Dynamically shaped models can be created by specifying static_shape=False, min_shapes (minimum size), opt_shapes (optimization size) and max_shapes (maximum size) in the TRT backend.
  
  Engine cache placement policy of vsmlrt.py is direct-mapped:
  - If dynamic shapes is used, engine name is determined by min_shapes, opt_shapes and max_shapes (among others).
  - Otherwise, engine name is determined by opt_shapes.
  - opt_shapes is usually set to tilesize in each specific model's interface if not initialized.
- workspace can now be set to None for unlimited workspace size. (#21)
- Add flag force_fp16. This flag forces fp16 computation during inference, and is disabled by default.
  - It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176) on a 4gb ampere)
  - It may reduce engine build time.
Introduce a new simplified backend interface BackendV2.
Disable tiling support for rife due to incompatible inference design.

dynamic shapes

In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.

benchmark

configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7, Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False), CUDA_MODULE_LOADING=LAZY
- Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1) static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions> or (2) static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>.
  
  opt_shapes may be lowered for faster engine generation.

measurements: fps / device memory (MB)

model	1 stream static	1 stream dynamic (1)	1 stream dynamic (2)	2 streams static	2 streams dynamic (1)	2 streams dynamic (2)
waifu2x upconv7 1920x1080	17.4 / 1992	17.5 / 1998	17.4 / 2040	21.2 / 3694	21.2 / 3756	20.9 / 3850
waifu2x upconv7 1280x720	37.2 / 1046	38.5 / 1930	37.8 / 1976	46.3 / 1818	48.2 / 3628	46.5 / 3722
waifu2x upconv7 720x480	102.2 / 544	104.4 / 1894	102.2 / 1940	123.1 / 834	128.1 / 3556	123.4 / 3650
dpir color 1920x1080	7.3 / 2114	7.3 / 2114	7.2 / 2116	7.5 / 3656	7.4 / 3992	7.4 / 4002
dpir color 1280x720	16.4 / 1122	16.3 / 2086	16.2 / 2092	16.7 / 1810	16.7 / 3936	16.7 / 3946
dpir color 720x480	41.5 / 604	41.5 / 2068	41.6 / 2074	44.3 / 863	44.3 / 3900	44.2 / 3910
real-esrgan v2 1920x1080	12.3 / 1320	12.3 / 1320	12.3 / 1320	13.0 / 2196	13.2 / 2402	12.9 / 2402
real-esrgan v2 1280x720	26.9 / 736	26.9 / 1256	27.0 / 1256	29.1 / 1130	29.3 / 2274	29.2 / 2274
real-esrgan v2 720x480	73.2 / 422	73.2 / 1220	72.7 / 1220	78.7 / 570	78.4 / 2202	78.1 / 2202
cugan 1920x1080	9.4 / 4648	9.4 / 4726	9.2 / 4618	9.8 / 8754	10.2 / 9210	9.9 / 8996
cugan 1280x720	20.5 / 2214	20.5 / 4662	20.0 / 4554	21.2 / 4050	22.9 / 9082	22.4 / 8868
cugan 720x480	54.8 / 996	53.7 / 4626	52.9 / 4518	57.7 / 1690	59.9 / 9019	58.9 / 8796
rife v4.4 1920x1088	92.8 / 590	92.2 / 594	89.5 / 606	178.6 / 920	177.6 / 942	169.5 / 974
rife v4.4 1280x736	206.0 / 410	199.2 / 534	199.1 / 550	394.2 / 560	377.5 / 822	374.0 / 854
rife v4.4 736x480	497.3 / 316	442.2 / 504	492.3 / 520	903.2 / 376	809.3 / 762	874.1 / 794

*The gap is large on rife because of underutilization, and will disappear when using more streams.

Contributors

chainikdn, MysteryDove, and 2 other contributors

Assets 13

09 Jan 15:10

github-actions

v12.3.test

97952cb

v12.3.test Pre-release

Pre-release

This is a preview release for https://github.com/AmusementClub/vs-mlrt/releases/tag/v13.

Assets 13

23 Nov 09:19

github-actions

v12.2

6a843b9

v12.2 Pre-release

Pre-release

Update vsmlrt.py:

Introduce a new release artifact ext-models.v12.2.7z, which comes from External Models, and it's not bundled into full binary release packages (i.e. the cpu, cuda and vk packages). Please refer to their release notes for details on how to use those models.
Export a new API vsmlrt.inference for inference of custom models.
```
import vsmlrt
output = vsmlrt.inference(clips, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))
```
If you encounter issues like Cannot find input tensor with name "input" in the network inputs! Please make sure the input tensor names are correct., you could use vsmlrt.inference(..., input_name=None) or export the model with its input name set to "input".
Fix trt inference of cugan-pro (3x) models. (#15)

Assets 13

07 Dec 07:20

WolframRhodium

external-models

7352920

External Models Pre-release

Pre-release

More models!

In addition to bundled models, vs-mlrt can also be used to run these models:

anime-segmentation/isnet_is.onnx: anime character segmentation at a0a563c, RGBS -> GRAYS, requires mod64 input
oidn/rt_ldr.onnx: image denoising from Intel® Open Image Denoise library, RGBS, requires mod16 input
ppocr/ml_PP-OCRv3_det.onnx: multilingual text detection model from PaddleOCR, RGBS -> GRAYS, requires mod32 input
waifu2x swin_unet: waifu2x's swin_unet models. It's supported by the Python wrapper with vsmlrt.Waifu2xModel.{swin_unet_art,swin_unet_art_scan,swin_unet_photo{v2}}.
- file list:
  - waifu2x/swin_unet_art/{scale2x, scale4x, noise0, noise0_scale2x, ..., noise3_scale4x}.onnx
  - waifu2x/swin_unet_art_scan/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx
  - waifu2x/swin_unet_photo{_v2}/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx]
RIFE (ensemble) (source):
- 4.0~4.6 ensemble
- 4.0~4.6 v2
- 4.7
- 4.8
- 4.9
- 4.10
- 4.11
- 4.12
- 4.12 lite
- 4.13
- 4.13 lite
- 4.14
- 4.14 lite
- 4.15
- 4.15 lite
- 4.16 lite
- 4.17
- 4.17 lite
- 4.18
- 4.19
- 4.20
- 4.21
- 4.22
- 4.22 lite
- 4.23
- 4.24
- 4.25
- 4.25 lite
- 4.25 heavy
- 4.26
- 4.26 heavy
v2 models handle paddings internally and reduce PCIe traffic flow.
safa/{safa_{v0.1,v0.2,v0.3,v0.4}_{non_adaptive,adaptive1x,adaptive}.onnx: SAFA video enhancement models. Individually packaged.
scunet: SCUNet denoisig models.
ArtCNN/ArtCNN_{C4F32,C16F64,R16F96,R8F64}{_Chroma,_DS}: ArtCNN models for anime super-resolution and restoration.

With more to come.

Also check onnx models provided by the avs-mlrt community.

Usage

If an external model is not supported by the Python wrapper, you can use the generic vsmlrt.inference API to run these models (requires release v12.2 or later).

import vsmlrt
output = vsmlrt.inference(rgbs, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))

The rife model requires auxiliary inputs and should be used from vsmlrt.RIFE or vsmlrt.RIFEMerge interface.

Assets 44

16 Nov 10:43

github-actions

v12.1

890dfe2

v12.1 Pre-release

Pre-release

This minor release fixes #9: now if vsort/vstrt fails to load required cuda DLLs, they won't crash the entire process.

However, if vs-mlrt is correctly installed, this shouldn't happen. Please report an issue if you can't access the core.trt or core.ort namespaces. Common mistake is forgetting to extract the vsmlrt-cuda.v12.1.7z package for VSORT-Windows-x64.v12.1.7z or VSTRT-Windows-x64.v12.1.7z packages. If in doubt, use the fully bundled release vsmlrt-windows-x64-cuda.v12.1.7z for CUDA users.

Note: we explicitly do not support using both pytorch and vs-mlrt plugins in the same vpy script as pytorch uses its own set of cuda DLL which might be in conflict with the ones vs-mlrt uses. As those DLLs are not explicitly versioned (e.g. nvinfer.dll instead of nvinfer-x.yz.dll), there is nothing we can do.

Assets 12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Note

benchmark 1

benchmark 2

Note

Contributors

dynamic shapes

benchmark

Contributors

More models!

Usage

Releases: AmusementClub/vs-mlrt

v14.test2: latest TensorRT library

Contributors

v14.test: latest TensorRT library

v13.2: latest ort library, DirectML backend

Note

benchmark 1

benchmark 2

v13.1: latest ov & ort library, new models

Note

Contributed Models

Contributors

v13: fp16 i/o, faster dynamic shapes for TRT backend

dynamic shapes

benchmark

Contributors

v12.3.test

v12.2

External Models

More models!

Usage

v12.1