FP8 inference performance question, possible W8A8 and 8-bit flash/sage attention acceleration using H100? #146

TheTinyTeddy · 2024-12-20T06:51:31Z

Hi,

Thank you for the amazing work!

I observed the throughput using bf16 is 34.2s/it, but the thought goes up to 34.4s/it when using fp8 that is slightly worse than bf16 DiT model. The GPU used is H100 which has FP8 Tensor Core compute capability. So I was wondering the possibility of W8A8 in linear layer and fp8 flash/sage attention acceleration for HunyuanVideo?

Many thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 inference performance question, possible W8A8 and 8-bit flash/sage attention acceleration using H100? #146

FP8 inference performance question, possible W8A8 and 8-bit flash/sage attention acceleration using H100? #146

TheTinyTeddy commented Dec 20, 2024 •

edited

Loading

FP8 inference performance question, possible W8A8 and 8-bit flash/sage attention acceleration using H100? #146

FP8 inference performance question, possible W8A8 and 8-bit flash/sage attention acceleration using H100? #146

Comments

TheTinyTeddy commented Dec 20, 2024 • edited Loading

TheTinyTeddy commented Dec 20, 2024 •

edited

Loading