-
Notifications
You must be signed in to change notification settings - Fork 0
TensorRT
Farley Lai edited this page Apr 30, 2021
·
1 revision
TensorRT is a set of runtime libraries dependent of the CUDA runtime.
The python binding wheel can be installed with a custom env pip
.
os="ubuntu1604"
tag="cuda11.0-trt7.2.2.3-ga-20201211"
sudo dpkg -i nv-tensorrt-repo-${os}-${tag}_1-1_amd64.deb
sudo apt-key add /var/nv-tensorrt-repo-${tag}/7fa2af80.pub
sudo apt-get update
# Everything including CUDA-11 through dependency
# sudo apt-get install tensorrt
# TensorRT runtime
version="7.2.2-1+cuda11.0"
sudo apt-get install libnvinfer7=${version} libnvonnxparsers7=${version} libnvparsers7=${version} libnvinfer-plugin7=${version} libnvinfer-dev=${version} libnvonnxparsers-dev=${version} libnvparsers-dev=${version} libnvinfer-plugin-dev=${version} python-libnvinfer=${version} python3-libnvinfer=${version}
## TensorRT using cuda10.2
version="7.1.3-1+cuda10.2"
sudo apt-get install libnvinfer7=${version} libnvonnxparsers7=${version} libnvparsers7=${version} libnvinfer-plugin7=${version} libnvinfer-dev=${version} libnvonnxparsers-dev=${version} libnvparsers-dev=${version} libnvinfer-plugin-dev=${version} python-libnvinfer=${version} python3-libnvinfer=${version}
-
CUDA Toolkit
- official from nvidia: drivers, toolchains and libraries
- conda distributions:
- cudatoolkit: runtime libraries
- cudatoolkit-dev: compiler toolchains
-
TensorRT
- official distribution:
- build tools(deb)
- runtime libraries(deb)
- python bindings(zip)
-
tensorrt-7.1.3.4-cp37-none-linux_x86_64.whl
for custom env
-
- official distribution:
- CuDNN
conda install cudnn-7.6.5
- PyCUDA
pip install pycuda
- torch2trt
pip install --install-option='--plugins' git+https://github.com/NVIDIA-AI-IOT/torch2trt.git@b0cc8e77a0fbd61e96b971a66bbc11326f77c6b5
os="ubuntu1604"
tag="cuda10.2-trt7.1.3.4-ga-20200617"
sudo dpkg -i nv-tensorrt-repo-${os}-${tag}_1-1_amd64.deb
sudo apt-get update
sudo apt-get install tensorrt libcudnn8 cuda-nvrtc-10-2
Runtime considerations:
- batch size
- workspace size
- mixed precision
- bounds on dynamic shapes
Optimizations:
- Elimination of layers whose outputs are not used
- Elimination of operations which are equivalent to no-op
- The fusion of convolution, bias and ReLU operations
- Aggregation of operations with sufficiently similar parameters and the same source tensor
- Merging of concatenation layers by directing layer outputs to the correct eventual destination
- Modification of the precision of weights if necessary with calibration to determine the dynamic range of intermediate activations, and hence the appropriate scaling factors for quantization
Specific to:
- target platform
- TensorRT version
- exact GPU model
- TensorRT allows you to increase GPU memory footprint during the engine building phase with the setMaxWorkspaceSize function. Increasing the limit may affect the number of applications that could share the GPU at the same time. Setting this limit too low may filter out several algorithms and create a suboptimal engine.
- In general, CUDA programming streams are a way of organizing asynchronous work. Asynchronous commands put into a stream are guaranteed to run in sequence but may execute out of order with respect to other streams. In particular, asynchronous commands in two streams may be scheduled to run concurrently (subject to hardware limitations). In the context of TensorRT and inference, each layer of the optimized final network will require work on the GPU. However, not all layers will be able to fully utilize the computation capabilities of the hardware. Scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization. Even if only some layers can be overlapped, overall performance will improve.
If there is CUDNN_STATUS_MAPPING_ERROR, try to initialize with pycuda
before pytorch
:
import pycuda.autoinit
Alternatively, manage host/cuda memeory with pytorch
only as shown by torch2trt.