Releases: AmusementClub/vs-mlrt
v14.test2: latest TensorRT library
This is a preview release for TensorRT 9.1.0, following v14.test
release.
-
Same as
v14.test
release, it requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped. -
TensorRT 9.1.0 is officially documented as
for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only
on Linux. The Windows build is downloaded from here, and can be used on other GPU models.On Windows, some users have reported crashes when using it in mpv(#65). This problem occurs on an earlier version of this release, which is now fixed.
-
Add parameters
bf16
(#64),custom_env
andcustom_args
to theTRT
backend.- fp16 execution of
Waifu2xModel.swin_unet_art
is more accurate, faster and uses less GPU memory than bf16 execution (benchmark)
- fp16 execution of
-
Device memory usage of model
Waifu2xModel.swin_unet_art
is reduced compared to TensorRT 9.0.1 on A10G with 1080p input (at 2.66 fps with 7.0GB VRAM usage) with default auxiliary stream heuristic.- TensorRT 9.0.1 using 7 auxiliary streams compared to TensorRT 9.1.0 (3 streams) results in significantly more device memory with no performance gain.
- Setting
max_aux_streams=3
lowers device memory usage of TensorRT 9.0.1 to ~8.9GB, andmax_aux_streams=0
corresponds to ~7.3GB usage. - TensorRT 9.1.0 with
max_aux_streams=0
uses ~6.7GB device memory.
-
Users should use the same version of TensorRT as provided (9.1.0) because runtime version checking is disabled in this release.
-
Added support for RIFE v4.8 - v4.12, v4.12 ~ v4.13 lite (ensemble) models, which are also available for previous vs-mlrt releases (simply download the new model file here and update
vsmlrt.py
). v4.8 and v4.9 models should have the same execution speed as v4.7, while v4.10-v4.12 models are equally heavier than previous models. Ensemble models are heavier than their non-ensemble counterparts.- Starting from RIFE v4.11, all rife models are temporarily moved here with individual packaging.
-
RIFE models with v2 representation for
TRT
backend now has improved accuracy, contributed by @charlessuh (#66 (comment)). This has been backported to master.- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to
vsmlrt.RIFE()
. - By default
vsmlrt.RIFE()
invsmlrt.py
uses v1 representation. The v2 representation is enabled withvsmlrt.RIFE(_implementation=2)
function call.Sample Error Message
input: for dimension number 1 in profile 0 does not match network definition (got min=11, opt=11, max=11), expected min=opt=max=7)
- v2 representation is still considered experimental.
- This improvement may be very slightly inefficient under onnx file renaming. It is advised to keep onnx file name unchanged and change the function call to
-
Added support for SAFA v0.1 video enhancement model.
- This model takes arbitrary sized video and uses both spatial and temporal information to improve visual quality.
- Note that this model is non-deterministic by nature, and existing backends does not support manual seeding.
- ~17 fps on RTX 4090 with
TRT(fp16)
, 1080p input and non-adaptive mode. Adaptive mode is about 2x slower than non-adaptive mode, uses more memory and does not support cuda graphs execution. - This representation is not supported by the
NCNN_VK
backend for the same issue as RIFE v2 representation.
-
Also check the release note of
v14.test
release.
- This pre-release uses trt 9.1.0 + cuda 12.2.2 + cudnn 8.9.5, which can only run on driver >= 525 and 10 series and later gpus, with improved support for self-attentions found in transformer models.
vsmlrt.py
in all branches can be used interchangeably.
- TensorRT 9.0.1 is
for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only
on x86 Linux. ModelWaifu2xModel.swin_unet_art
is 1.2x faster compared to TensorRT 8.6.1 on A10G with 720p input (at 6.3 fps with 4GB VRAM usage), thanks to multi-head attention fusion (requires fp16).
This pre-release is now feature complete. Development now switch to the v14.test3
pre-release.
v14.test: latest TensorRT library
This is a preview release for TensorRT 8.6.1.
-
It requires Pascal GPUs or later (10 series+) and driver version >= 525. Support for Kepler 2.0 and Maxwell GPUs is dropped.
-
Add parameters
builder_optimization_level
andmax_aux_streams
to theTRT
backend.builder_optimization_level
: "adjust how long TensorRT should spend searching for tactics with potentially better performance" linkmax_aux_streams
: Within-inference multi-streaming, "if enabled, TensorRT will run some layers on the auxiliary streams in parallel to the layers running on the main stream, ..., may increase the memory consumption, ..." link- It is advised to lower
max_aux_streams
to 0 on heavy models likeWaifu2xModel.swin_unet_art
to reduce memory usage. Check the benchmark data at the bottom.
- It is advised to lower
-
Following TensorRT 8.6.1,
cudnn
tactic source of theTRT
backend is disabled by default.tf32
is also disabled by default in vsmlrt.py. -
Add parameter
short_path
to theTRT
backend, which shortens engine path and is enabled on Windows by default. -
Model
Waifu2xModel.swin_unet_art
seems does not work withbuilder_optimization_level=5
from theTRT
backend before TRT 9.0. Usebuilder_optimization_level=4
or lower instead.
Less than 5% performance improvement among built-in models compared to 13.1/13.2, 24% device memory usage reduction on DPIR and 35% on RealESRGAN.
Version information:
v13.2
release uses trt 8.5.1 + cuda 11.8.0, which can run on driver >= 450 and 900 series and later gpus.v14.test
pre-release uses trt 8.6.1 + cuda 12.1.1, which can only run on driver >= 525 and 10 series and later gpus, with no significant performance improvement measured.vsmlrt.py
in both branches can be used interchangeably.
- Added support for RIFE v4.7 model (
"optimized for anime scenes"
), which is also available for previous vs-mlrt releases (simply download the new model file here and updatevsmlrt.py
). It is more computational intensive than v4.6.
- This pre-release is now feature complete. Development now switch to
trt-latest
branch andv14.test2
pre-release.
v13.2: latest ort library, DirectML backend
- Added support for DirectML backend through ONNX Runtime. It is available for all dx12 devices and can be accessed through
backend=Backend.ORT_DML()
. waifu2x swin_unet models may not be supported on this backend, and RIFE models may be poorly supported. - Asset
vsmlrt-windows-x64-vk.*.7z
is renamed tovsmlrt-windows-x64-generic-gpu.*.7z
and includes backendsOV_CPU
,OV_GPU
,ORT_CPU
,ORT_DML
,NCNN_VK
.cuda
asset continue to include all backends in this release. - Update onnxruntime to microsoft/onnxruntime@73584f9.
Note
- Backend
OV_GPU
may produces reduced precision output. This is under investigation.
benchmark 1
NVIDIA GeForce RTX 3090, 10496 shaders @ 1695 MHz, driver 532.03, Windows 10 21H2 LTSC (19044.1415), Python 3.11.3, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | ORT_CUDA | ORT_DML |
---|---|---|
dpir | 4.25 / 2573.3 | 7.01 / 2371.0 |
dpir (2 streams) | 4.58 / 5506.2 | 8.85 / 4643.1 |
waifu2x upconv7 | 9.10 / 5248.1 | 9.65 / 4503.1 |
waifu2x upconv7 (2 streams) | 11.15 / 2966.9 | 18.52 / 8911.2 |
waifu2x cunet / cugan | 4.06 / 7875.7 | 6.36 / 8973.7 |
waifu2x cunet / cugan (2 streams) | N/A | 9.51 / 17849.1 |
waifu2x swin_unet_art | N/A | N/A |
realesrgan | 7.52 / 1901.7 | 8.54 / 1352.4 |
realesrgan (2 streams) | 11.15 / 2966.9 | 15.58 / 2608.7 |
rife | 34.30 / 1109.1 | 2.12 / 1417.8 |
rife (2 streams) | 61.45 / 2051.4 | 4.27 / 2740.9 |
benchmark 2
AMD Radeon Pro V620 MxGPU, 4608 shaders @ 2390 MHz, Adrenalin 21.20.02.13, Windows Server 2019, Python 3.11.3, vapoursynth-classic R57.A8
Measurements: (1080p, fp16) FPS / Device Memory (MB)
model | NCNN_VK | ORT_DML |
---|---|---|
dpir | 1.70 / 3248.4 | 4.75 / 2308.1 |
dpir (2 streams) | 1.74 / 6099.5 | 4.86 / 4584.6 |
waifu2x upconv7 | 5.18 / 6872.3 | 14.51 / 4448.5 |
waifu2x upconv7 (2 streams) | 6.14 / 13701 | 15.98 / 8861.2 |
waifu2x cunet / cugan (2x2 tiles) | 1.07 / 3159.8 | 5.57 / 2196.7 |
waifu2x cunet / cugan (2x2 tiles, 2 streams) | 1.07 / 3159.8 | 6.08 / 4357.8 |
waifu2x swin_unet_art | N/A | N/A |
realesrgan | 3.86 / 2699.7 | 9.59 / 1290.4 |
realesrgan (2 streams) | 4.43 / 5355.8 | 10.58 / 2545.3 |
rife | N/A | 2.68 / 1353.5 |
rife (2 streams) | N/A | 4.44 / 2673.3 |
v13.1: latest ov & ort library, new models
- Update openvino to openvinotoolkit/openvino@b0ffec4 , improved support for RIFE in both
OV_CPU
andOV_GPU
. (benchmark on arc a380) - Fix a typo in vsmlrt.py's
__all__
. - Added parameter
num_threads
in theOV_CPU
backend - Update onnxruntime to microsoft/onnxruntime@8ed3dfe.
- Default value of
workspace
of the TRT backend will be changed toNone
in the next release, anduse_cudnn
will be changed toFalse
.
the following models can be found on external models:
- Add support for waifu2x
swin_unet
models. - Add support for
ensemble
configuration of RIFE. (RIFE v2 acceleration is experimental and may results in reduced quality onTRT
backend with fp16 enabled.)
Note
- Backend
OV_GPU
may produces reduced precision output. This is under investigation.
Contributed Models
Please see PR #42 for policy.
-
AnimeJaNai_v2.7z:
RealESRGANv2/{animejanaiV2L1, animejanaiV2L2, animejanaiV2L3}.onnx
, contributed by @hooke007 in #53. -
AnimeJaNai_v3.7z:
RealESRGANv2/{animejanaiV3-HD-L1, animejanaiV3-HD-L2, animejanaiV3-HD-L3}.onnx
, contributed by @hooke007 in #82. -
ani4k_v2_v1.7z:
RealESRGANv2/{Ani4Kv2_G6i2_Compact, Ani4Kv2_G6i2_UltraCompact}.onnx
from Ani4K v2, contributed by @srk24 in #105.
v13: fp16 i/o, faster dynamic shapes for TRT backend
-
Added support for fp16 I/O format and faster dynamic shapes in the
TRT
backend.- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The
TRT
backend now accepts fp16 clips beyond fp32, and the output format can be specified via parameteroutput_format
(0 for fp32 and 1 for fp16).
As the only portable way to convert fp32 clips to fp16 is via
resize
(std.Expr
only supports fp16 when the cpu supports the f16c instruction set extension), and to conserve memory bandwidth, you could use the following snippet for RIFE to perform the necessary padding, YUV/RGB conversion and FP16 conversion in one go:th = (src.height + 31) // 32 * 32 # adjust 32 and 31 to match specific AI network input resolution requirements. tw = (src.width + 31) // 32 * 32 # same. padded = src.resize.Bicubic(tw, th, format=vs.RGBS if WANT_FP32 else vs.RGBH, matrix_in_s="709", src_width=tw, src_height=th) flt = vsmlrt.RIFE(padded, model=vsmlrt.RIFEModel.v4_6, backend=backend, output_format=1) # fp16 output oh = src.height * (flt.height // th) # not necessary for RIFE (i.e. oh = src.height), but required for super-resolution upscalers. ow = src.width * (flt.width // tw) res = flt.resize.Bicubic(ow, oh, format=vs.YUV420P8, matrix_s="709", src_width=ow, src_height=oh)
-
Faster dynamic shapes introduced in TensorRT 8.5.1 improves performance and device memory usage of dynamically shaped engines (#20 (comment) and the following benchmark).
Dynamically shaped models can be created by specifying
static_shape=False
,min_shapes
(minimum size),opt_shapes
(optimization size) andmax_shapes
(maximum size) in theTRT
backend.Engine cache placement policy of
vsmlrt.py
is direct-mapped:- If dynamic shapes is used, engine name is determined by
min_shapes
,opt_shapes
andmax_shapes
(among others). - Otherwise, engine name is determined by
opt_shapes
. opt_shapes
is usually set totilesize
in each specific model's interface if not initialized.
- If dynamic shapes is used, engine name is determined by
-
workspace
can now be set to None for unlimited workspace size. (#21) -
Add flag
force_fp16
. This flag forces fp16 computation during inference, and is disabled by default.- It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with
min_shapes=(64, 64), opt_shapes=(1920, 1088), max_shapes=(3840, 2176)
on a 4gb ampere) - It may reduce engine build time.
- It reduces memory usage during engine build, and allows more engines to be built, e.g. 1080p rife on gpus with 4gb memory. (successfully build dynamically shaped rife with
- Thanks to @hooke007 @grobalt @MysteryDove @chainikdn and many others users on the svp forum, it becomes clear that reducing system bandwidth requirement is crucial to 4K RIFE performance on powerful GPUs (https://github.com/AmusementClub/vs-mlrt/discussions/19). The
-
Introduce a new simplified backend interface
BackendV2
. -
Disable tiling support for rife due to incompatible inference design.
dynamic shapes
In conclusion, dynamic shapes should be much more flexible when dealing with different video resolutions (no engine re-compilation is required), and incurs almost no performance degradation starting with TensorRT 8.5. Only increased device memory usage will be a concern.
benchmark
-
configuration: nvidia a10 (ecc disabled), driver 527.41 (tcc), windows server 2022, python 3.11.1, vapoursynth-classic R57.A7,
Backend.TRT(fp16=True, use_cuda_graph=True, tf32=False, use_cudnn=False)
,CUDA_MODULE_LOADING=LAZY
-
Statically shaped engines for each model are compiled separately, while the dynamically shaped engine is compiled once with (1)
static_shape=False, min_shapes=(64, 64), opt_shapes=<max-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>
or (2)static_shape=False, min_shapes=(64, 64), opt_shapes=<min-benchmarked-video-dimensions>, max_shapes=<max-benchmarked-video-dimensions>
.opt_shapes
may be lowered for faster engine generation.
-
measurements: fps / device memory (MB)
model | 1 stream static | 1 stream dynamic (1) | 1 stream dynamic (2) | 2 streams static | 2 streams dynamic (1) | 2 streams dynamic (2) |
---|---|---|---|---|---|---|
waifu2x upconv7 1920x1080 | 17.4 / 1992 | 17.5 / 1998 | 17.4 / 2040 | 21.2 / 3694 | 21.2 / 3756 | 20.9 / 3850 |
waifu2x upconv7 1280x720 | 37.2 / 1046 | 38.5 / 1930 | 37.8 / 1976 | 46.3 / 1818 | 48.2 / 3628 | 46.5 / 3722 |
waifu2x upconv7 720x480 | 102.2 / 544 | 104.4 / 1894 | 102.2 / 1940 | 123.1 / 834 | 128.1 / 3556 | 123.4 / 3650 |
dpir color 1920x1080 | 7.3 / 2114 | 7.3 / 2114 | 7.2 / 2116 | 7.5 / 3656 | 7.4 / 3992 | 7.4 / 4002 |
dpir color 1280x720 | 16.4 / 1122 | 16.3 / 2086 | 16.2 / 2092 | 16.7 / 1810 | 16.7 / 3936 | 16.7 / 3946 |
dpir color 720x480 | 41.5 / 604 | 41.5 / 2068 | 41.6 / 2074 | 44.3 / 863 | 44.3 / 3900 | 44.2 / 3910 |
real-esrgan v2 1920x1080 | 12.3 / 1320 | 12.3 / 1320 | 12.3 / 1320 | 13.0 / 2196 | 13.2 / 2402 | 12.9 / 2402 |
real-esrgan v2 1280x720 | 26.9 / 736 | 26.9 / 1256 | 27.0 / 1256 | 29.1 / 1130 | 29.3 / 2274 | 29.2 / 2274 |
real-esrgan v2 720x480 | 73.2 / 422 | 73.2 / 1220 | 72.7 / 1220 | 78.7 / 570 | 78.4 / 2202 | 78.1 / 2202 |
cugan 1920x1080 | 9.4 / 4648 | 9.4 / 4726 | 9.2 / 4618 | 9.8 / 8754 | 10.2 / 9210 | 9.9 / 8996 |
cugan 1280x720 | 20.5 / 2214 | 20.5 / 4662 | 20.0 / 4554 | 21.2 / 4050 | 22.9 / 9082 | 22.4 / 8868 |
cugan 720x480 | 54.8 / 996 | 53.7 / 4626 | 52.9 / 4518 | 57.7 / 1690 | 59.9 / 9019 | 58.9 / 8796 |
rife v4.4 1920x1088 | 92.8 / 590 | 92.2 / 594 | 89.5 / 606 | 178.6 / 920 | 177.6 / 942 | 169.5 / 974 |
rife v4.4 1280x736 | 206.0 / 410 | 199.2 / 534 | 199.1 / 550 | 394.2 / 560 | 377.5 / 822 | 374.0 / 854 |
rife v4.4 736x480 | 497.3 / 316 | 442.2 / 504 | 492.3 / 520 | 903.2 / 376 | 809.3 / 762 | 874.1 / 794 |
*The gap is large on rife because of underutilization, and will disappear when using more streams.
v12.3.test
This is a preview release for https://github.com/AmusementClub/vs-mlrt/releases/tag/v13.
v12.2
Update vsmlrt.py:
-
Introduce a new release artifact
ext-models.v12.2.7z
, which comes from External Models, and it's not bundled into full binary release packages (i.e. thecpu
,cuda
andvk
packages). Please refer to their release notes for details on how to use those models. -
Export a new API
vsmlrt.inference
for inference of custom models.import vsmlrt output = vsmlrt.inference(clips, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))
If you encounter issues like
Cannot find input tensor with name "input" in the network inputs! Please make sure the input tensor names are correct.
, you could usevsmlrt.inference(..., input_name=None)
or export the model with its input name set to "input". -
Fix
trt
inference of cugan-pro (3x) models. (#15)
External Models
More models!
In addition to bundled models, vs-mlrt can also be used to run these models:
-
anime-segmentation/isnet_is.onnx
: anime character segmentation ata0a563c
, RGBS -> GRAYS, requires mod64 input -
oidn/rt_ldr.onnx
: image denoising from Intel® Open Image Denoise library, RGBS, requires mod16 input -
ppocr/ml_PP-OCRv3_det.onnx
: multilingual text detection model from PaddleOCR, RGBS -> GRAYS, requires mod32 input -
waifu2x swin_unet: waifu2x's swin_unet models. It's supported by the Python wrapper with
vsmlrt.Waifu2xModel.{swin_unet_art,swin_unet_art_scan,swin_unet_photo{v2}}
.-
file list:
waifu2x/swin_unet_art/{scale2x, scale4x, noise0, noise0_scale2x, ..., noise3_scale4x}.onnx
waifu2x/swin_unet_art_scan/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx
waifu2x/swin_unet_photo{_v2}/{scale4x, noise0_scale4x, ..., noise3_scale4x}.onnx]
-
- v2 models handle paddings internally and reduce PCIe traffic flow.
-
safa/{safa_{v0.1,v0.2,v0.3,v0.4}_{non_adaptive,adaptive1x,adaptive}.onnx
: SAFA video enhancement models. Individually packaged. -
ArtCNN/ArtCNN_{C4F32,C16F64,R16F96,R8F64}{_Chroma,_DS}
: ArtCNN models for anime super-resolution and restoration.
With more to come.
Also check onnx models provided by the avs-mlrt community.
Usage
If an external model is not supported by the Python wrapper, you can use the generic vsmlrt.inference
API to run these models (requires release v12.2 or later).
import vsmlrt
output = vsmlrt.inference(rgbs, "path/to/onnx", backend=vsmlrt.Backend.TRT(fp16=True))
The rife
model requires auxiliary inputs and should be used from vsmlrt.RIFE
or vsmlrt.RIFEMerge
interface.
v12.1
This minor release fixes #9: now if vsort/vstrt fails to load required cuda DLLs, they won't crash the entire process.
However, if vs-mlrt is correctly installed, this shouldn't happen. Please report an issue if you can't access the core.trt
or core.ort
namespaces. Common mistake is forgetting to extract the vsmlrt-cuda.v12.1.7z
package for VSORT-Windows-x64.v12.1.7z
or VSTRT-Windows-x64.v12.1.7z
packages. If in doubt, use the fully bundled release vsmlrt-windows-x64-cuda.v12.1.7z
for CUDA users.
Note: we explicitly do not support using both pytorch and vs-mlrt plugins in the same vpy script as pytorch uses its own set of cuda DLL which might be in conflict with the ones vs-mlrt uses. As those DLLs are not explicitly versioned (e.g. nvinfer.dll
instead of nvinfer-x.yz.dll
), there is nothing we can do.