-
In the documentation for Memory-efficient attention (link to relevant documentation) we have an "equivalent pytorch code" part. In that code, to get raw attention scores, we are multiplying a tensor Q with the shape [Batch, SeqLen, NumHeads, Dimension] As an (intermediate) result, we get a tensor of the size [Batch, SeqLen, NumHeads, NumHeads], over the last dimension of which we then do softmax and then multiply the result by V. I am a bit confused about how this corresponds to traditional multihead attention calculation since it seems that we are essentially attending across attention heads here instead of between sequence elements within one head. I would appreciate it if somebody could help me understand this part. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, |
Beta Was this translation helpful? Give feedback.
Hi,
Thanks for pointing that out! Historically, there was no head dimension when this doc was written, and then everything makes sense.
But you are right, with 4d inputs, the code is not equivalent. We need to fix this