Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Compute Time is different between trtexec command line and python context.execute_async_v2 #3747

Closed
undcloud opened this issue Mar 28, 2024 · 1 comment

Comments

@undcloud
Copy link

undcloud commented Mar 28, 2024

Description

trtexec GPU Compute Time: 197ms
python context.execute_async_v2 GPU Compute Time: 1ms
pytorch python GPU Compute Time: 11ms

Environment

Container: nvcr.io/nvidia/tensorrt 24.02-py3

TensorRT Version:

NVIDIA GPU: NVIDIA GeForce RTX 2080

NVIDIA Driver Version: 535.129.03

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt 24.02-py3

Commands or scripts:
python test command:

def _do_inference_base(inputs, outputs, stream, execute_async):
    # Transfer input data to the GPU.
    cpu2gpu_time_start = time.time()
    kind = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice
    [cuda_call(cudart.cudaMemcpyAsync(inp.device, inp.host, inp.nbytes, kind, stream)) for inp in inputs]
    cpu2gpu_time_end = time.time()
    print('cpu2gpu: ', cpu2gpu_time_end - cpu2gpu_time_start)    
    
    # Run inference.
    ai_time_start = time.time()            
    execute_async()
    ai_time_end = time.time()            
    print('ai_time: ', ai_time_end - ai_time_start)
    
    # Transfer predictions back from the GPU.
    gpu2cpu_time_start = time.time()                           
    kind = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost
    [cuda_call(cudart.cudaMemcpyAsync(out.host, out.device, out.nbytes, kind, stream)) for out in outputs]
    # Synchronize the stream
    cuda_call(cudart.cudaStreamSynchronize(stream))
    gpu2cpu_time_end = time.time()                        
    print('gpu2cpu: ', gpu2cpu_time_end - gpu2cpu_time_start)    
    
    # Return only the host outputs.
    return [out.host for out in outputs]

output:

# cpu2gpu:  4.57763671875e-05
# ai_time:  0.0011186599731445312
# gpu2cpu:  0.20252299308776855

trtexec:

trtexec --onnx=./Demoire_1536.onnx

output:

# trtexec --onnx=./Demoire_1536.onnx
&&&& RUNNING TensorRT.trtexec [TensorRT v8603] # trtexec --onnx=./Demoire_1536.onnx
[03/28/2024-07:10:11] [I] === Model Options ===
[03/28/2024-07:10:11] [I] Format: ONNX
[03/28/2024-07:10:11] [I] Model: ./Demoire_1536.onnx
[03/28/2024-07:10:11] [I] Output:
[03/28/2024-07:10:11] [I] === Build Options ===
[03/28/2024-07:10:11] [I] Max batch: explicit batch
[03/28/2024-07:10:11] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[03/28/2024-07:10:11] [I] minTiming: 1
[03/28/2024-07:10:11] [I] avgTiming: 8
[03/28/2024-07:10:11] [I] Precision: FP32
[03/28/2024-07:10:11] [I] LayerPrecisions: 
[03/28/2024-07:10:11] [I] Layer Device Types: 
[03/28/2024-07:10:11] [I] Calibration: 
[03/28/2024-07:10:11] [I] Refit: Disabled
[03/28/2024-07:10:11] [I] Version Compatible: Disabled
[03/28/2024-07:10:11] [I] ONNX Native InstanceNorm: Disabled
[03/28/2024-07:10:11] [I] TensorRT runtime: full
[03/28/2024-07:10:11] [I] Lean DLL Path: 
[03/28/2024-07:10:11] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[03/28/2024-07:10:11] [I] Exclude Lean Runtime: Disabled
[03/28/2024-07:10:11] [I] Sparsity: Disabled
[03/28/2024-07:10:11] [I] Safe mode: Disabled
[03/28/2024-07:10:11] [I] Build DLA standalone loadable: Disabled
[03/28/2024-07:10:11] [I] Allow GPU fallback for DLA: Disabled
[03/28/2024-07:10:11] [I] DirectIO mode: Disabled
[03/28/2024-07:10:11] [I] Restricted mode: Disabled
[03/28/2024-07:10:11] [I] Skip inference: Disabled
[03/28/2024-07:10:11] [I] Save engine: 
[03/28/2024-07:10:11] [I] Load engine: 
[03/28/2024-07:10:11] [I] Profiling verbosity: 0
[03/28/2024-07:10:11] [I] Tactic sources: Using default tactic sources
[03/28/2024-07:10:11] [I] timingCacheMode: local
[03/28/2024-07:10:11] [I] timingCacheFile: 
[03/28/2024-07:10:11] [I] Heuristic: Disabled
[03/28/2024-07:10:11] [I] Preview Features: Use default preview flags.
[03/28/2024-07:10:11] [I] MaxAuxStreams: -1
[03/28/2024-07:10:11] [I] BuilderOptimizationLevel: -1
[03/28/2024-07:10:11] [I] Input(s)s format: fp32:CHW
[03/28/2024-07:10:11] [I] Output(s)s format: fp32:CHW
[03/28/2024-07:10:11] [I] Input build shapes: model
[03/28/2024-07:10:11] [I] Input calibration shapes: model
[03/28/2024-07:10:11] [I] === System Options ===
[03/28/2024-07:10:11] [I] Device: 0
[03/28/2024-07:10:11] [I] DLACore: 
[03/28/2024-07:10:11] [I] Plugins:
[03/28/2024-07:10:11] [I] setPluginsToSerialize:
[03/28/2024-07:10:11] [I] dynamicPlugins:
[03/28/2024-07:10:11] [I] ignoreParsedPluginLibs: 0
[03/28/2024-07:10:11] [I] 
[03/28/2024-07:10:11] [I] === Inference Options ===
[03/28/2024-07:10:11] [I] Batch: Explicit
[03/28/2024-07:10:11] [I] Input inference shapes: model
[03/28/2024-07:10:11] [I] Iterations: 10
[03/28/2024-07:10:11] [I] Duration: 3s (+ 200ms warm up)
[03/28/2024-07:10:11] [I] Sleep time: 0ms
[03/28/2024-07:10:11] [I] Idle time: 0ms
[03/28/2024-07:10:11] [I] Inference Streams: 1
[03/28/2024-07:10:11] [I] ExposeDMA: Disabled
[03/28/2024-07:10:11] [I] Data transfers: Enabled
[03/28/2024-07:10:11] [I] Spin-wait: Disabled
[03/28/2024-07:10:11] [I] Multithreading: Disabled
[03/28/2024-07:10:11] [I] CUDA Graph: Disabled
[03/28/2024-07:10:11] [I] Separate profiling: Disabled
[03/28/2024-07:10:11] [I] Time Deserialize: Disabled
[03/28/2024-07:10:11] [I] Time Refit: Disabled
[03/28/2024-07:10:11] [I] NVTX verbosity: 0
[03/28/2024-07:10:11] [I] Persistent Cache Ratio: 0
[03/28/2024-07:10:11] [I] Inputs:
[03/28/2024-07:10:11] [I] === Reporting Options ===
[03/28/2024-07:10:11] [I] Verbose: Disabled
[03/28/2024-07:10:11] [I] Averages: 10 inferences
[03/28/2024-07:10:11] [I] Percentiles: 90,95,99
[03/28/2024-07:10:11] [I] Dump refittable layers:Disabled
[03/28/2024-07:10:11] [I] Dump output: Disabled
[03/28/2024-07:10:11] [I] Profile: Disabled
[03/28/2024-07:10:11] [I] Export timing to JSON file: 
[03/28/2024-07:10:11] [I] Export output to JSON file: 
[03/28/2024-07:10:11] [I] Export profile to JSON file: 
[03/28/2024-07:10:11] [I] 
[03/28/2024-07:10:12] [I] === Device Information ===
[03/28/2024-07:10:12] [I] Selected Device: NVIDIA GeForce RTX 2080
[03/28/2024-07:10:12] [I] Compute Capability: 7.5
[03/28/2024-07:10:12] [I] SMs: 46
[03/28/2024-07:10:12] [I] Device Global Memory: 7974 MiB
[03/28/2024-07:10:12] [I] Shared Memory per SM: 64 KiB
[03/28/2024-07:10:12] [I] Memory Bus Width: 256 bits (ECC disabled)
[03/28/2024-07:10:12] [I] Application Compute Clock Rate: 1.71 GHz
[03/28/2024-07:10:12] [I] Application Memory Clock Rate: 7 GHz
[03/28/2024-07:10:12] [I] 
[03/28/2024-07:10:12] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[03/28/2024-07:10:12] [I] 
[03/28/2024-07:10:12] [I] TensorRT version: 8.6.3
[03/28/2024-07:10:12] [I] Loading standard plugins
[03/28/2024-07:10:12] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 19, GPU 423 (MiB)
[03/28/2024-07:10:16] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +889, GPU +174, now: CPU 984, GPU 597 (MiB)
[03/28/2024-07:10:16] [I] Start parsing network model.
[03/28/2024-07:10:16] [I] [TRT] ----------------------------------------------------------------
[03/28/2024-07:10:16] [I] [TRT] Input filename:   ./Demoire_1536.onnx
[03/28/2024-07:10:16] [I] [TRT] ONNX IR version:  0.0.7
[03/28/2024-07:10:16] [I] [TRT] Opset version:    14
[03/28/2024-07:10:16] [I] [TRT] Producer name:    pytorch
[03/28/2024-07:10:16] [I] [TRT] Producer version: 2.0.0
[03/28/2024-07:10:16] [I] [TRT] Domain:           
[03/28/2024-07:10:16] [I] [TRT] Model version:    0
[03/28/2024-07:10:16] [I] [TRT] Doc string:       
[03/28/2024-07:10:16] [I] [TRT] ----------------------------------------------------------------
[03/28/2024-07:10:16] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[03/28/2024-07:10:16] [I] Finished parsing network model. Parse time: 0.0539368
[03/28/2024-07:10:16] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[03/28/2024-07:10:17] [I] [TRT] Graph optimization time: 0.289515 seconds.
[03/28/2024-07:10:17] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[03/28/2024-07:10:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.   [03/28/2024-07:16:04] [I] [TRT] [GraphReduction] The approximate region cut reduction algorithm is called.
[03/28/2024-07:16:04] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[03/28/2024-07:16:05] [I] [TRT] Total Host Persistent Memory: 676736
[03/28/2024-07:16:05] [I] [TRT] Total Device Persistent Memory: 72025600
[03/28/2024-07:16:05] [I] [TRT] Total Scratch Memory: 0
[03/28/2024-07:16:05] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 2884 MiB
[03/28/2024-07:16:05] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 280 steps to complete.
[03/28/2024-07:16:05] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 53.8601ms to assign 13 blocks to 280 nodes requiring 1536787968 bytes.
[03/28/2024-07:16:05] [I] [TRT] Total Activation Memory: 1536786432
[03/28/2024-07:16:05] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +98, now: CPU 0, GPU 98 (MiB)
[03/28/2024-07:16:05] [I] Engine built in 353.318 sec.
[03/28/2024-07:16:05] [I] [TRT] Loaded engine size: 34 MiB
[03/28/2024-07:16:05] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +98, now: CPU 0, GPU 98 (MiB)
[03/28/2024-07:16:05] [I] Engine deserialized in 0.0211241 sec.
[03/28/2024-07:16:05] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1534, now: CPU 0, GPU 1632 (MiB)
[03/28/2024-07:16:05] [I] Setting persistentCacheLimit to 0 bytes.
[03/28/2024-07:16:05] [I] Using random values for input onnx::Reshape_0
[03/28/2024-07:16:05] [I] Input binding for onnx::Reshape_0 with dimensions 1x3x1536x1536 is created.
[03/28/2024-07:16:05] [I] Output binding for 781 with dimensions 1x3x384x384 is created.
[03/28/2024-07:16:05] [I] Output binding for 903 with dimensions 1x3x768x768 is created.
[03/28/2024-07:16:05] [I] Output binding for 1025 with dimensions 1x3x1536x1536 is created.
[03/28/2024-07:16:05] [I] Starting inference
[03/28/2024-07:16:09] [I] Warmup completed 1 queries over 200 ms
[03/28/2024-07:16:09] [I] Timing trace has 18 queries over 3.75475 s
[03/28/2024-07:16:09] [I] 
[03/28/2024-07:16:09] [I] === Trace details ===
[03/28/2024-07:16:09] [I] Trace averages of 10 runs:
[03/28/2024-07:16:09] [I] Average on 10 runs - GPU latency: 197.328 ms - Host latency: 202.374 ms (enqueue 3.8028 ms)
[03/28/2024-07:16:09] [I] 
[03/28/2024-07:16:09] [I] === Performance summary ===
[03/28/2024-07:16:09] [I] Throughput: 4.79393 qps
[03/28/2024-07:16:09] [I] Latency: min = 201.983 ms, max = 202.73 ms, mean = 202.333 ms, median = 202.289 ms, percentile(90%) = 202.688 ms, percentile(95%) = 202.73 ms, percentile(99%) = 202.73 ms
[03/28/2024-07:16:09] [I] Enqueue Time: min = 0.779495 ms, max = 5.79248 ms, mean = 3.94108 ms, median = 4.29803 ms, percentile(90%) = 4.83594 ms, percentile(95%) = 5.79248 ms, percentile(99%) = 5.79248 ms
[03/28/2024-07:16:09] [I] H2D Latency: min = 2.1731 ms, max = 2.22754 ms, mean = 2.19205 ms, median = 2.1871 ms, percentile(90%) = 2.21805 ms, percentile(95%) = 2.22754 ms, percentile(99%) = 2.22754 ms
[03/28/2024-07:16:09] [I] GPU Compute Time: min = 196.907 ms, max = 197.688 ms, mean = 197.285 ms, median = 197.253 ms, percentile(90%) = 197.626 ms, percentile(95%) = 197.688 ms, percentile(99%) = 197.688 ms
[03/28/2024-07:16:09] [I] D2H Latency: min = 2.83167 ms, max = 2.89685 ms, mean = 2.8566 ms, median = 2.85107 ms, percentile(90%) = 2.87549 ms, percentile(95%) = 2.89685 ms, percentile(99%) = 2.89685 ms
[03/28/2024-07:16:09] [I] Total Host Walltime: 3.75475 s
[03/28/2024-07:16:09] [I] Total GPU Compute Time: 3.55112 s
[03/28/2024-07:16:09] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/28/2024-07:16:09] [I]

pytorch test code:

        with torch.no_grad():        
            cpu2gpu_time_start = time.time()
            image = image.to('cuda')
            cpu2gpu_time_end = time.time()
            
            ai_time_start = time.time()            
            pred,_,_ = model_(image)    
            ai_time_end = time.time()            
            
            gpu2cpu_time_start = time.time()                        
            pred = pred.float().to('cpu').numpy()
            gpu2cpu_time_end = time.time()       

output:

cpu2gpu:  0.002646207809448242
ai_time:  0.01140141487121582
gpu2cpu:  0.28691959381103516

Have you tried the latest release?:Yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):No
polygraphy run Demoire_1536.onnx --onnxrt
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored
[I] RUNNING | Command: /usr/local/bin/polygraphy run Demoire_1536.onnx --onnxrt
[I] onnxrt-runner-N0-03/28/24-07:34:06 | Activating and starting inference
[!] Module: 'onnxruntime' is required but could not be imported.
Note: Error was: No module named 'onnxruntime'
You can set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables to allow Polygraphy to automatically install missing modules.
[E] FAILED | Runtime: 0.024s | Command: /usr/local/bin/polygraphy run Demoire_1536.onnx --onnxrt

@lix19937
Copy link

lix19937 commented Mar 30, 2024

If you use trtexec to get performance metric use

/usr/src/tensorrt/bin/trtexec --onnx=${onnx} --useCudaGraph --workspace=10240 \
--verbose  --separateProfileRun 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants