You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Passing the --use-flash-attn flag is intended to enable flash attention; however, when the --use-mcore-models flag (to use the transformer engine) is also specified, flash attention will not be applied.
The text was updated successfully, but these errors were encountered:
TransformerEngine has its own logic to select attention backends. Generally speaking, if all the conditions match, such as q/k/v shape and layout, attention bias, sliding window, and so on, TE prefers FlashAttention v2 on pre-Hopper GPUs (SM < 90) and cuDNN FusedAttention on Hopper and later GPUs (SM >= 90).
If you want to know why a certain attention backend is selected or disabled, you can set the following env vars to enable TE's logging.
Passing the --use-flash-attn flag is intended to enable flash attention; however, when the --use-mcore-models flag (to use the transformer engine) is also specified, flash attention will not be applied.
The text was updated successfully, but these errors were encountered: