decoder MMHA kernel support INT8 SCALE_Q_INSTEAD_OF_K and SCALE_P_INS… #2085

lishicheng1996 · 2024-08-05T08:09:44Z

Following the logic of MMHA_FP8_SCALE_Q_INSTEAD_OF_K and MMHA_FP8_SCALE_P_INSTEAD_OF_V, I implemented the INT8 version.

It is theoretically equivalent to the original compute logic without any numeric accuracy degradation.

I tested the speed on H20, A100, and 4090 GPUs. The results show that the average latency of the MMHA kernel decreased by about 5% to 8%, thanks to fewer FMUL instructions.

Nsight Compute Kernel Summary

Nsight Compute Instruction Statistics

…TEAD_OF_V

lishicheng1996 · 2024-08-13T08:44:56Z

@byshiue @Shixiaowei02 May you pls review and verify this PR~ Thank you very much!

PerkzZheng · 2024-09-23T01:36:45Z

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

lishicheng1996 · 2024-09-23T04:33:33Z

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

Hi, thanks for your review~
I checked the MMLU accuracy score on Llama3.0-8B, with int8 smooth quant.
With moving forward the INT8 scale, I didn't see accuracy drop~

PerkzZheng · 2024-09-23T04:42:34Z

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

Hi, thanks for your review~ I checked the MMLU accuracy score on Llama3.0-8B, with int8 smooth quant. With moving forward the INT8 scale, I didn't see accuracy drop~

Thanks. I will run more tests for different models locally, and merge that into the internal TRT-LLM, and release it in coming weeks if everything looks good. Thanks again.

lishicheng1996 · 2024-11-04T07:09:14Z

@lishicheng1996 moving forward the INT8 scale is not enabled because we observed the accuracy drop because of that. Have you checked the accuracy after enabling that ? I will give it another try locally.

Hi, thanks for your review~ I checked the MMLU accuracy score on Llama3.0-8B, with int8 smooth quant. With moving forward the INT8 scale, I didn't see accuracy drop~

Thanks. I will run more tests for different models locally, and merge that into the internal TRT-LLM, and release it in coming weeks if everything looks good. Thanks again.

Hi, may I ask how's the accuracy in your local test? ^_^

decoder MMHA kernel support INT8 SCALE_Q_INSTEAD_OF_K and SCALE_P_INS…

c7400b6

…TEAD_OF_V

Shixiaowei02 requested review from Shixiaowei02 and byshiue August 5, 2024 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoder MMHA kernel support INT8 SCALE_Q_INSTEAD_OF_K and SCALE_P_INS… #2085

decoder MMHA kernel support INT8 SCALE_Q_INSTEAD_OF_K and SCALE_P_INS… #2085

lishicheng1996 commented Aug 5, 2024 •

edited

Loading

lishicheng1996 commented Aug 13, 2024

PerkzZheng commented Sep 23, 2024

lishicheng1996 commented Sep 23, 2024 •

edited

Loading

PerkzZheng commented Sep 23, 2024

lishicheng1996 commented Nov 4, 2024

decoder MMHA kernel support INT8 SCALE_Q_INSTEAD_OF_K and SCALE_P_INS… #2085

Are you sure you want to change the base?

decoder MMHA kernel support INT8 SCALE_Q_INSTEAD_OF_K and SCALE_P_INS… #2085

Conversation

lishicheng1996 commented Aug 5, 2024 • edited Loading

lishicheng1996 commented Aug 13, 2024

PerkzZheng commented Sep 23, 2024

lishicheng1996 commented Sep 23, 2024 • edited Loading

PerkzZheng commented Sep 23, 2024

lishicheng1996 commented Nov 4, 2024

lishicheng1996 commented Aug 5, 2024 •

edited

Loading

lishicheng1996 commented Sep 23, 2024 •

edited

Loading