BertQA sample throws segementation fault (TensorRT 10.3) when running GPU Jetson Orin Nano #4220

krishnarajk · 2024-10-23T12:05:05Z

Description

I tired running the bertQA sample in Jetson Orin nano with jetpack 6.1
I used Bert Base, because Bert Large kills itself when building the engine(may be because of memory issue).

[10/23/2024-13:27:53] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +7, GPU +67, now: CPU 2160, GPU 6001 (MiB)
[10/23/2024-13:27:53] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/23/2024-13:28:39] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[10/23/2024-13:28:42] [TRT] [I] Total Host Persistent Memory: 316288
[10/23/2024-13:28:42] [TRT] [I] Total Device Persistent Memory: 110592
[10/23/2024-13:28:42] [TRT] [I] Total Scratch Memory: 0
[10/23/2024-13:28:42] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 164 steps to complete.
[10/23/2024-13:28:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 3.28999ms to assign 5 blocks to 164 nodes requiring 1378304 bytes.
[10/23/2024-13:28:43] [TRT] [I] Total Activation Memory: 1378304
[10/23/2024-13:28:43] [TRT] [I] Total Weights Memory: 170059792
[10/23/2024-13:28:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU -1, now: CPU 2372, GPU 6707 (MiB)
[10/23/2024-13:28:43] [TRT] [I] Engine generation completed in 51.1302 seconds.
[10/23/2024-13:28:43] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 4 MiB, GPU 384 MiB
[10/23/2024-13:28:43] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 3087 MiB
[10/23/2024-13:28:43] [TRT] [I] build engine in 52.969 Sec
[10/23/2024-13:28:44] [TRT] [I] Saving Engine to engines/bert_base_128.engine
[10/23/2024-13:28:44] [TRT] [I] Done.

The I used the inference.py, with the same sample given in the examples.
python3 inference.py -e engines/bert_base_128.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/vocab.txt
It throws segmenation fault
`
[10/23/2024-13:30:07] [TRT] [I] Loaded engine size: 208 MiB
[10/23/2024-13:30:08] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +8, GPU +70, now: CPU 317, GPU 4590 (MiB)
[10/23/2024-13:30:08] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +7, GPU +64, now: CPU 109, GPU 4379 (MiB)
[10/23/2024-13:30:08] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 0, GPU 163 (MiB)

Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.

Question: What is TensorRT?
Segmentation fault (core dumped)
`
** https://github.com/NVIDIA/TensorRT/tree/release/10.3/demo/BERT#model-overview
** I dont use the OSS container, but installed these on device

Please help me over here.

Environment

TensorRT Version: 10.3

NVIDIA GPU: Amper, Jetson Orin nano

NVIDIA Driver Version: Jetpack 6.1

CUDA Version: 12.6

CUDNN Version:

Operating System: 22.04

Python Version (if applicable): 3.10

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-10-24T05:10:56Z

You can use tegrastats to watch the RAM usage situation when trtexec load engine to infer.

krishnarajk · 2024-10-24T12:20:28Z

This is the RAM usage when I run the inference.
`0-24-2024 14:19:16 RAM 4518/7620MB (lfb 1x1MB) CPU [8%@729,21%@729,10%@729,8%@729,100%@1510,10%@1510] GR3D_FREQ 76% [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] VDD_IN 5291mW/4526mW VDD_CPU_GPU_CV 1500mW/997mW VDD_SOC 1500mW/1358mW

10-24-2024 14:19:17 RAM 4516/7620MB (lfb 1x4MB) CPU [6%@1510,6%@1510,5%@1510,6%@1510,100%@1510,8%@1510] GR3D_FREQ 99% [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] VDD_IN 4864mW/4538mW VDD_CPU_GPU_CV 1263mW/1006mW VDD_SOC 1461mW/1361mW

10-24-2024 14:19:18 RAM 4517/7620MB (lfb 1x4MB) CPU [11%@729,12%@729,17%@729,9%@729,99%@1510,3%@1510] GR3D_FREQ 0% [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] VDD_IN 4746mW/4545mW VDD_CPU_GPU_CV 1145mW/1011mW VDD_SOC 1421mW/1363mW
`
Is shortage of memory causes segmentation issue in this case?

krishnarajk · 2024-10-24T19:29:25Z

I also tried increasing the swap memory by 4GB

lix19937 · 2024-10-25T01:07:58Z

Can you try to use trtexec load engine to infer ?

krishnarajk · 2024-10-27T21:11:06Z

how can i do that? I am new tensorrt and was trying to run the sample application. I have tensorrt installed on my container
but it is showing

trtexec --help
bash: trtexec: command not found

dpkg -l|grep -i tensorrt
ii  libnvinfer-dev                       10.3.0.26-1+cuda12.5                    arm64        TensorRT development libraries
ii  libnvinfer-dispatch-dev              10.3.0.26-1+cuda12.5                    arm64        TensorRT development dispatch runtime libraries
ii  libnvinfer-dispatch10                10.3.0.26-1+cuda12.5                    arm64        TensorRT dispatch runtime library
ii  libnvinfer-headers-dev               10.3.0.26-1+cuda12.5                    arm64        TensorRT development headers
ii  libnvinfer-headers-plugin-dev        10.3.0.26-1+cuda12.5                    arm64        TensorRT plugin headers
ii  libnvinfer-lean-dev                  10.3.0.26-1+cuda12.5                    arm64        TensorRT lean runtime libraries
ii  libnvinfer-lean10                    10.3.0.26-1+cuda12.5                    arm64        TensorRT lean runtime library
ii  libnvinfer-plugin-dev                10.3.0.26-1+cuda12.5                    arm64        TensorRT plugin libraries
ii  libnvinfer-plugin10                  10.3.0.26-1+cuda12.5                    arm64        TensorRT plugin libraries
ii  libnvinfer-vc-plugin-dev             10.3.0.26-1+cuda12.5                    arm64        TensorRT vc-plugin library
ii  libnvinfer-vc-plugin10               10.3.0.26-1+cuda12.5                    arm64        TensorRT vc-plugin library
ii  libnvinfer10                         10.3.0.26-1+cuda12.5                    arm64        TensorRT runtime libraries
ii  libnvonnxparsers-dev                 10.3.0.26-1+cuda12.5                    arm64        TensorRT ONNX libraries
ii  libnvonnxparsers10                   10.3.0.26-1+cuda12.5                    arm64        TensorRT ONNX libraries
ii  python3-libnvinfer                   10.3.0.26-1+cuda12.5                    arm64        Python 3 bindings for TensorRT standard runtime

lix19937 · 2024-10-28T00:58:22Z

trtexec --onnx=your_onnx_file --verbose @krishnarajk

krishnarajk · 2024-10-28T20:05:21Z

This is for loading the onnx model right? How do i do inference with a engine, using trtexec?

i tired
/trtexec --loadEngine=/TensorRT/demo/BERT/engines/bert_base_128.engine --verbose

and gets this as log

[10/28/2024-21:14:18] [I] === Performance summary ===
[10/28/2024-21:14:18] [I] Throughput: 171.08 qps
[10/28/2024-21:14:18] [I] Latency: min = 5.69727 ms, max = 10.7233 ms, mean = 5.87895 ms, median = 5.71912 ms, percentile(90%) = 5.73627 ms, percentile(95%) = 7.51221 ms, percentile(99%) = 8.71997 ms
[10/28/2024-21:14:18] [I] Enqueue Time: min = 0.673218 ms, max = 1.84448 ms, mean = 1.21463 ms, median = 1.21777 ms, percentile(90%) = 1.3186 ms, percentile(95%) = 1.34717 ms, percentile(99%) = 1.51196 ms
[10/28/2024-21:14:18] [I] H2D Latency: min = 0.0236206 ms, max = 1.71045 ms, mean = 0.0453783 ms, median = 0.0424805 ms, percentile(90%) = 0.0490723 ms, percentile(95%) = 0.0534668 ms, percentile(99%) = 0.067749 ms
[10/28/2024-21:14:18] [I] GPU Compute Time: min = 5.65784 ms, max = 10.6769 ms, mean = 5.82668 ms, median = 5.66943 ms, percentile(90%) = 5.677 ms, percentile(95%) = 7.46094 ms, percentile(99%) = 8.66394 ms
[10/28/2024-21:14:18] [I] D2H Latency: min = 0.00488281 ms, max = 0.00933838 ms, mean = 0.00689694 ms, median = 0.00695801 ms, percentile(90%) = 0.00805664 ms, percentile(95%) = 0.00830078 ms, percentile(99%) = 0.00897217 ms
[10/28/2024-21:14:18] [I] Total Host Walltime: 3.02199 s
[10/28/2024-21:14:18] [I] Total GPU Compute Time: 3.01239 s
[10/28/2024-21:14:18] [W] * GPU compute time is unstable, with coefficient of variance = 10.5757%.
[10/28/2024-21:14:18] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/28/2024-21:14:18] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/28/2024-21:14:18] [V] 
[10/28/2024-21:14:18] [V] === Explanations of the performance metrics ===
[10/28/2024-21:14:18] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[10/28/2024-21:14:18] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[10/28/2024-21:14:18] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[10/28/2024-21:14:18] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[10/28/2024-21:14:18] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[10/28/2024-21:14:18] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[10/28/2024-21:14:18] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[10/28/2024-21:14:18] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[10/28/2024-21:14:18] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100300] # ./trtexec --loadEngine=/home/vwif/Documents/thesis/TensorRT/demo/BERT/engines/bert_base_128.engine --verbose

I hope this means the engine doesnt have any probelm. but i still i have the segmentation fault when i try to run the sample inference.py

lix19937 · 2024-10-29T01:11:42Z

This is for loading the onnx model right? How do i do inference with a engine, using trtexec?

trtexec --onnx=your_onnx_file --verbose --saveEngine=your_plan will load onnx to build and then infer with engine.

I hope this means the engine doesnt have any probelm. but i still i have the segmentation fault when i try to run the sample inference.py

It maybe your code has bug. You can use trtexec to get engine file, then use follow py https://github.com/lix19937/tensorrt-insight/blob/main/tool/infer_from_engine.py @krishnarajk

krishnarajk changed the title ~~BertQA sample throws segementation fault on TensorRT 10.3 when running GPU Jetson Orin Nano~~ BertQA sample throws segementation fault (TensorRT 10.3) when running GPU Jetson Orin Nano Oct 23, 2024

yuanyao-nv added the triaged Issue has been triaged by maintainers label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BertQA sample throws segementation fault (TensorRT 10.3) when running GPU Jetson Orin Nano #4220

BertQA sample throws segementation fault (TensorRT 10.3) when running GPU Jetson Orin Nano #4220

krishnarajk commented Oct 23, 2024 •

edited

Loading

lix19937 commented Oct 24, 2024

krishnarajk commented Oct 24, 2024 •

edited

Loading

krishnarajk commented Oct 24, 2024

lix19937 commented Oct 25, 2024

krishnarajk commented Oct 27, 2024 •

edited

Loading

lix19937 commented Oct 28, 2024

krishnarajk commented Oct 28, 2024 •

edited

Loading

lix19937 commented Oct 29, 2024

BertQA sample throws segementation fault (TensorRT 10.3) when running GPU Jetson Orin Nano #4220

BertQA sample throws segementation fault (TensorRT 10.3) when running GPU Jetson Orin Nano #4220

Comments

krishnarajk commented Oct 23, 2024 • edited Loading

Description

Environment

lix19937 commented Oct 24, 2024

krishnarajk commented Oct 24, 2024 • edited Loading

krishnarajk commented Oct 24, 2024

lix19937 commented Oct 25, 2024

krishnarajk commented Oct 27, 2024 • edited Loading

lix19937 commented Oct 28, 2024

krishnarajk commented Oct 28, 2024 • edited Loading

lix19937 commented Oct 29, 2024

krishnarajk commented Oct 23, 2024 •

edited

Loading

krishnarajk commented Oct 24, 2024 •

edited

Loading

krishnarajk commented Oct 27, 2024 •

edited

Loading

krishnarajk commented Oct 28, 2024 •

edited

Loading