You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I observed the throughput using bf16 is 34.2s/it, but the thought goes up to 34.4s/it when using fp8 that is slightly worse than bf16 DiT model. The GPU used is H100 which has FP8 Tensor Core compute capability. So I was wondering the possibility of W8A8 in linear layer and fp8 flash/sage attention acceleration for HunyuanVideo?
Many thanks
The text was updated successfully, but these errors were encountered:
Hi,
Thank you for the amazing work!
I observed the throughput using bf16 is 34.2s/it, but the thought goes up to 34.4s/it when using fp8 that is slightly worse than bf16 DiT model. The GPU used is H100 which has FP8 Tensor Core compute capability. So I was wondering the possibility of W8A8 in linear layer and fp8 flash/sage attention acceleration for HunyuanVideo?
Many thanks
The text was updated successfully, but these errors were encountered: