[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed #1259

efsotr · 2024-10-26T12:38:06Z

Passing the --use-flash-attn flag is intended to enable flash attention; however, when the --use-mcore-models flag (to use the transformer engine) is also specified, flash attention will not be applied.

yaox12 · 2024-11-04T03:52:18Z

TransformerEngine has its own logic to select attention backends. Generally speaking, if all the conditions match, such as q/k/v shape and layout, attention bias, sliding window, and so on, TE prefers FlashAttention v2 on pre-Hopper GPUs (SM < 90) and cuDNN FusedAttention on Hopper and later GPUs (SM >= 90).

If you want to know why a certain attention backend is selected or disabled, you can set the following env vars to enable TE's logging.

NVTE_DEBUG=1
NVTE_DEBUG_LEVEL=2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed #1259

[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed #1259

efsotr commented Oct 26, 2024

yaox12 commented Nov 4, 2024 •

edited

Loading

[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed #1259

[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed #1259

Comments

efsotr commented Oct 26, 2024

yaox12 commented Nov 4, 2024 • edited Loading

yaox12 commented Nov 4, 2024 •

edited

Loading