Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 inference performance question, possible W8A8 and 8-bit flash/sage attention acceleration using H100? #146

Open
TheTinyTeddy opened this issue Dec 20, 2024 · 0 comments

Comments

@TheTinyTeddy
Copy link

TheTinyTeddy commented Dec 20, 2024

Hi,

Thank you for the amazing work!

I observed the throughput using bf16 is 34.2s/it, but the thought goes up to 34.4s/it when using fp8 that is slightly worse than bf16 DiT model. The GPU used is H100 which has FP8 Tensor Core compute capability. So I was wondering the possibility of W8A8 in linear layer and fp8 flash/sage attention acceleration for HunyuanVideo?

Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant