Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed #1259

Open
efsotr opened this issue Oct 26, 2024 · 1 comment

Comments

@efsotr
Copy link

efsotr commented Oct 26, 2024

Passing the --use-flash-attn flag is intended to enable flash attention; however, when the --use-mcore-models flag (to use the transformer engine) is also specified, flash attention will not be applied.

@yaox12
Copy link
Contributor

yaox12 commented Nov 4, 2024

TransformerEngine has its own logic to select attention backends. Generally speaking, if all the conditions match, such as q/k/v shape and layout, attention bias, sliding window, and so on, TE prefers FlashAttention v2 on pre-Hopper GPUs (SM < 90) and cuDNN FusedAttention on Hopper and later GPUs (SM >= 90).

If you want to know why a certain attention backend is selected or disabled, you can set the following env vars to enable TE's logging.

NVTE_DEBUG=1
NVTE_DEBUG_LEVEL=2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants