Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to quantize Linear/LN/ReLU-like structures with int8. #4242

Open
WeixiangXu opened this issue Nov 8, 2024 · 5 comments
Open

How to quantize Linear/LN/ReLU-like structures with int8. #4242

WeixiangXu opened this issue Nov 8, 2024 · 5 comments
Labels
Embedded issues when using TensorRT on embedded platforms Performance General performance issues triaged Issue has been triaged by maintainers

Comments

@WeixiangXu
Copy link

My TensorRT version is 8.6.10 on Orin.

My model is Linear/LN/ReLU-like structure as below:
Image

I add Q/DQ nodes before MatMul node to do INT8 as below:
Image

However, INT8 is slower than FP16.

I draw the INT8 engine figure as below.
Image

What is the best practice for quantize Linear/LN/ReLU-like structures? which takes about 50% latency in my model.

@lix19937
Copy link

lix19937 commented Nov 9, 2024

You can export onnx with ops=17, which make ln as one node.
On the other hand, usually ln in int8 data-type will greatly affected the accuracy of the model.

@lix19937
Copy link

lix19937 commented Nov 9, 2024

Also you can refer to trt-llm or ft to impl custom layer.

@WeixiangXu
Copy link
Author

You can export onnx with ops=17, which make ln as one node. On the other hand, usually ln in int8 data-type will greatly affected the accuracy of the model.

@lix19937 Thanks for your reply!

I upgrade opset to 17.
Image

However, int8 with Q/DQ nodes is still slower than fp16. (int8: 7.5 ms v.s. fp16: 6 ms)

@WeixiangXu
Copy link
Author

@ttyio @zerollzeng Could you please share any thoughts you might have?

@lix19937
Copy link

However, int8 with Q/DQ nodes is still slower than fp16. (int8: 7.5 ms v.s. fp16: 6 ms)

You can try to test a onnx which only include transpose + matmul + ln + add + relu, then compare the latency.

@poweiw poweiw added Performance General performance issues triaged Issue has been triaged by maintainers Embedded issues when using TensorRT on embedded platforms labels Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Embedded issues when using TensorRT on embedded platforms Performance General performance issues triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants