On onnxruntime-gpu, CUDAProvider, Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. #20309
Unanswered
surajrao2003
asked this question in
EP Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to inference a dynamically quantized yolov8s onnx model on GPU.
I have used yolov8s.pt and exported it to yolov8.onnx using onnx export. Then I quantized the onnx model using dynamic quantization (uint8) method provided by onnxruntime which reduced the model size by around 4 times. Though the quantized model worked fine while inferencing on CPU (CPUExecutionProvider), it gives low fps (frames per sec) while inferencing on GPU using CUDAExecutionProvider.
Mentioning both providers CUDAExecutionProvider and CPUExecutionProvider does make the warnings disappear but what I am concerned is about why some nodes are forcefully being executed on the CPU and not CUDA? Is it because CUDA does not yet provide support for such quantized nodes or is there any other particular reason for it?
When checked using log.severity, got this information
I am curious about why CUDA is not able to handle these nodes. Because in my opinion it is very unusual to see GPU inference giving a lower fps as compared to CPU inference.
I believe that the issue is due to the limitations of ONNX runtime support for quantized operations (int or uint) on CUDA. If this is true, is there any work going on for providing support to these quantized operations which should further enhance the GPU performance?
Is there any alternative to run these quantized operations exclusively on CUDA?
Beta Was this translation helpful? Give feedback.
All reactions