Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoder MMHA kernel support INT8 SCALE_Q_INSTEAD_OF_K and SCALE_P_INS… #2085

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lishicheng1996
Copy link

@lishicheng1996 lishicheng1996 commented Aug 5, 2024

Following the logic of MMHA_FP8_SCALE_Q_INSTEAD_OF_K and MMHA_FP8_SCALE_P_INSTEAD_OF_V, I implemented the INT8 version.

It is theoretically equivalent to the original compute logic without any numeric accuracy degradation.

I tested the speed on H20, A100, and 4090 GPUs. The results show that the average latency of the MMHA kernel decreased by about 5% to 8%, thanks to fewer FMUL instructions.

Nsight Compute Kernel Summary
image

Nsight Compute Instruction Statistics
image

@lishicheng1996
Copy link
Author

@byshiue @Shixiaowei02 May you pls review and verify this PR~ Thank you very much!

@PerkzZheng
Copy link
Collaborator

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

@lishicheng1996
Copy link
Author

lishicheng1996 commented Sep 23, 2024

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

Hi, thanks for your review~
I checked the MMLU accuracy score on Llama3.0-8B, with int8 smooth quant.
With moving forward the INT8 scale, I didn't see accuracy drop~

@PerkzZheng
Copy link
Collaborator

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

Hi, thanks for your review~ I checked the MMLU accuracy score on Llama3.0-8B, with int8 smooth quant. With moving forward the INT8 scale, I didn't see accuracy drop~

Thanks. I will run more tests for different models locally, and merge that into the internal TRT-LLM, and release it in coming weeks if everything looks good. Thanks again.

@lishicheng1996
Copy link
Author

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

Hi, thanks for your review~ I checked the MMLU accuracy score on Llama3.0-8B, with int8 smooth quant. With moving forward the INT8 scale, I didn't see accuracy drop~

Thanks. I will run more tests for different models locally, and merge that into the internal TRT-LLM, and release it in coming weeks if everything looks good. Thanks again.

Hi, may I ask how's the accuracy in your local test? ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants