Confused about documentation example #908

R-seny · 2023-10-27T02:52:48Z

R-seny
Oct 27, 2023

In the documentation for Memory-efficient attention (link to relevant documentation) we have an "equivalent pytorch code" part.

In that code, to get raw attention scores, we are multiplying a tensor Q with the shape [Batch, SeqLen, NumHeads, Dimension]
and a tensor K.transpose(-2, -1) with the shape [Batch, SeqLen, Dimension, NumHeads].

As an (intermediate) result, we get a tensor of the size [Batch, SeqLen, NumHeads, NumHeads], over the last dimension of which we then do softmax and then multiply the result by V.

I am a bit confused about how this corresponds to traditional multihead attention calculation since it seems that we are essentially attending across attention heads here instead of between sequence elements within one head. I would appreciate it if somebody could help me understand this part.

Answered by danthe3rd

Oct 27, 2023

Hi,
Thanks for pointing that out! Historically, there was no head dimension when this doc was written, and then everything makes sense.
But you are right, with 4d inputs, the code is not equivalent. We need to fix this

View full answer

danthe3rd · 2023-10-27T07:47:43Z

danthe3rd
Oct 27, 2023
Collaborator

Hi,
Thanks for pointing that out! Historically, there was no head dimension when this doc was written, and then everything makes sense.
But you are right, with 4d inputs, the code is not equivalent. We need to fix this

1 reply

R-seny Oct 27, 2023
Author

I see, thank you for the answer! Good to know I was not going crazy :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused about documentation example #908

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Confused about documentation example #908

R-seny Oct 27, 2023

Replies: 1 comment · 1 reply

danthe3rd Oct 27, 2023 Collaborator

R-seny Oct 27, 2023 Author

R-seny
Oct 27, 2023

Replies: 1 comment 1 reply

danthe3rd
Oct 27, 2023
Collaborator

R-seny Oct 27, 2023
Author