Visualize attention map for vision transformer #1232

kiashann · 2022-04-25T20:00:21Z

kiashann
Apr 25, 2022

Hi, I want to extract attention map from pretrained vision transformer for specific image.
How I can do that?

hankyul2 · 2022-05-07T04:57:29Z

hankyul2
May 7, 2022

This is toy examples to visualize whole attention map and attention map only for class token. (see here for more information)

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from timm.models import create_model
import torch.nn.functional as F
from torchvision.transforms import Compose, Resize, CenterCrop, Normalize, ToTensor

def to_tensor(img):
    transform_fn = Compose([Resize(249, 3), CenterCrop(224), ToTensor(), Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    return transform_fn(img)

def show_img(img):
    img = np.asarray(img)
    plt.figure(figsize=(10, 10))
    plt.imshow(img)
    plt.axis('off')
    plt.show()

def show_img2(img1, img2, alpha=0.8):
    img1 = np.asarray(img1)
    img2 = np.asarray(img2)
    plt.figure(figsize=(10, 10))
    plt.imshow(img1)
    plt.imshow(img2, alpha=alpha)
    plt.axis('off')
    plt.show()

def my_forward_wrapper(attn_obj):
    def my_forward(x):
        B, N, C = x.shape
        qkv = attn_obj.qkv(x).reshape(B, N, 3, attn_obj.num_heads, C // attn_obj.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv.unbind(0)   # make torchscript happy (cannot use tensor as tuple)

        attn = (q @ k.transpose(-2, -1)) * attn_obj.scale
        attn = attn.softmax(dim=-1)
        attn = attn_obj.attn_drop(attn)
        attn_obj.attn_map = attn
        attn_obj.cls_attn_map = attn[:, :, 0, 2:]

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = attn_obj.proj(x)
        x = attn_obj.proj_drop(x)
        return x
    return my_forward

img = Image.open('n02102480_Sussex_spaniel.JPEG')
x = to_tensor(img)

model = create_model('deit_small_distilled_patch16_224', pretrained=True)
model.blocks[-1].attn.forward = my_forward_wrapper(model.blocks[-1].attn)

y = model(x.unsqueeze(0))
attn_map = model.blocks[-1].attn.attn_map.mean(dim=1).squeeze(0).detach()
cls_weight = model.blocks[-1].attn.cls_attn_map.mean(dim=1).view(14, 14).detach()

img_resized = x.permute(1, 2, 0) * 0.5 + 0.5
cls_resized = F.interpolate(cls_weight.view(1, 1, 14, 14), (224, 224), mode='bilinear').view(224, 224, 1)

show_img(img)
show_img(attn_map)
show_img(cls_weight)
show_img(img_resized)
show_img2(img_resized, cls_resized, alpha=0.8)

original image

attention map for last layer (198 x 198 (=196(img) + 1(cls) + 1(distill)))

class attention map for last layer (14 x 14)

class attention map over image

4 replies

arnavs04 Jun 17, 2024

The attention scores are a bit scattered here, usually the cls token focuses on certain patches and they are consistent. Are you sure taking the mean is a good idea? Also it might be better to rollout the cls token over multiple blocks.

arnavs04 Jun 17, 2024

So I tried taking the product of the cls token weights over other blocks like this but it shows the error

RuntimeError                              Traceback (most recent call last)
Cell In[18], line 13
     11 # Forward pass through all blocks
     12 for block in model.blocks:
---> 13     x, attn_map = block.attn.forward(x)
     14     outputs.append(x)
     15     attn_maps.append(attn_map)

Cell In[8], line 4, in my_forward_wrapper.<locals>.my_forward(x)
      2 def my_forward(x):
      3     B, N, C = x.shape
----> 4     qkv = attn_obj.qkv(x).reshape(B, N, 3, attn_obj.num_heads, C // attn_obj.num_heads).permute(2, 0, 3, 1, 4)
      5     q, k, v = qkv.unbind(0)   # make torchscript happy (cannot use tensor as tuple)
      7     attn = (q @ k.transpose(-2, -1)) * attn_obj.scale

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (672x224 and 384x1152)

Code:

model = create_model('deit_small_distilled_patch16_224', pretrained=True)
# Replace forward function in all blocks
for block in model.blocks:
    block.attn.forward = my_forward_wrapper(block.attn)

# Forward pass through the model
outputs = []
attn_maps = []
cls_weights = []

# Forward pass through all blocks
for block in model.blocks:
    x, attn_map = block.attn.forward(x)
    outputs.append(x)
    attn_maps.append(attn_map)
    cls_weights.append(block.attn.cls_attn_map.min(dim=1).values.view(14, 14).detach())

# Combine class scores of all blocks
cls_weight_combined = torch.prod(torch.stack(cls_weights), dim=0)

# Resize input image and class weights
img_resized = x.permute(0, 2, 3, 1) * 0.5 + 0.5
cls_resized = F.interpolate(cls_weight_combined.view(1, 1, 14, 14), (224, 224), mode='bilinear').view(224, 224, 1)

# Visualize
show_img(image)
show_img(attn_maps[-1])  # Attention map from the last block
show_img(cls_weight_combined)  # Combined class weights
show_img(img_resized.squeeze())  # Squeeze the batch dimension
show_img2(img_resized.squeeze(), cls_resized, alpha=0.8)  # Squeeze the batch dimension

hankyul2 Jun 18, 2024

Hi @arnavs04

That's a good question. You're right. There may be a better way to merge different attention maps. The code I give contains very basic ways to merge different attention maps.

The error you posted looks like a dimension mismatch error:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (672x224 and 384x1152)

I suggest applying a patch embedding layer to the input image before passing it into the attention block. You should also ensure that other layers (e.g., normalization, mlp) are applied appropriately.

Thank you.

Hankyul.

arnavs04 Jun 18, 2024

Ohh, thank you for the help!!

mae338 · 2024-03-10T12:08:13Z

mae338
Mar 10, 2024

@kiashann
Thank you for your valuable code. the whole code is working fine but I just need to understand how these lines work : model.blocks[-1].attn.forward=my_forward_wrapper(model.blocks[-1].attn) and attn_map = model.blocks[-1].attn.attn_map.mean(dim=1).squeeze(0).detach() so I ran these lines of code in the console : attn_obj=model.blocks[-1].attn & qkv = attn_obj.qkv(x) but got this error (RuntimeError: mat1 and mat2 shapes cannot be multiplied (672x224 and 384x1152)). I'd like to know why. I need to make sure whether x is the transformed image or some other variable – When I debugged the code I found out that B,N,C from x.shape are 1 &384 &198, which are different from the dimensions of the transformed image

5 replies

hankyul2 Mar 10, 2024

Hi @mae338

I am happy to help you.

In y=model(x.unsqueeze(0)), x.unsqueeze(0) is a transformed image that has shapes as (1, 3, 224, 224).
In model.blocks[-1].attn.forward=my_forward_wrapper(model.blocks[-1].attn), we replace the original forward to our my_forward_wrapper to save attention map as an instance variable of model.blocks[-1].attn.
In model.blocks[-1].attn.attn_map.mean(dim=1).squeeze(0).detach(), we average attention map in head-dimension .mean(dim=1) so that we can display whole attention.
To understand better, we recommend you print out every shape of the tensor, which helps you to see the overall workflow of ViT.

Thank you.

Hankyul

mae338 Mar 10, 2024

Thank you so much, Hankyul. I appreciate your effort to help me. I'd be pleased if you could help me understand this as well:
What about def my_forward(x):
B, N, C = x.shape
qkv = attn_obj.qkv(x)?
more especifically, the x parameter? what does it refer to?
and why did you create the function my_forward?
I also found out that the function my_forward must return x or else it doesn't work and the whole program won't work. I wonder why?

hankyul2 Mar 11, 2024

Hi @mae338

I hope this could help you.

We replace the original Attention forward() method to my_forward() method that only inserts two extra codes for saving attention map (attn_obj.attn_map = attn, attn_obj.cls_attn_map = attn[:, :, 0, 2:]). Thus, everything (function, signature, and return value) should be the same as the original function except for additional code.
x is image tokens (Batch x N x D) input to attention. Image tokens are tokenized by a linear layer, and their shape depends on the model's size. In our case, DeiT-small/16 splits the whole image into 196 (N=198, with 2 extra class, distill tokens) patches and has 384 (D=384) channel dimensions.
Since the attention block passes the output to the next block such as the MLP block, my_forward() should also pass the return value. If you skip passing the return value to the next block, the next blocks get a None value, which is an unexpected situation for them, thereby generating an error.

Thank you.

Hankyul

mae338 Mar 19, 2024

Hi @hankyul2,

I’d like to view the outputs of a pretrained vit model (vit-base-batch-16 224) , especially the input to the mlp head. I tried the same function but it didn’t work. Any suggestions?
Thank you
mae

hankyul2 Mar 20, 2024

Hi @mae338

I hope this can help you.

You can extract the ultimate features of a pre-trained ViT by y = model.forward_head(x, pre_logits=True) and visualize them for your purpose. Since it was a long time ago when I upload initial code in first comment, I copied and modified them as:

# dependency
!wget https://user-images.githubusercontent.com/31476895/167238573-b0cc3a6d-d3ee-462b-8630-a8f253e69bb2.png
!pip install -Uq fastai timm==0.6.13 huggingface_hub
##########################################

# code
import numpy as np
from PIL import Image
from timm.models import create_model
from torch import nn
from torchvision.transforms import Compose, Resize, CenterCrop, Normalize, ToTensor

def to_tensor(img):
    transform_fn = Compose([Resize(249), CenterCrop(224), ToTensor(), Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    return transform_fn(img)

img = Image.open('167238573-b0cc3a6d-d3ee-462b-8630-a8f253e69bb2.png').convert('RGB')
x = to_tensor(img)

model = create_model('deit_small_distilled_patch16_224', pretrained=True)
x = model.forward_features(x.unsqueeze(0))
y = model.forward_head(x, pre_logits=True)
print(y.shape)
##########################################

# output
torch.Size([1, 384])
##########################################

Thank you.

Hankyul

tomos7231 · 2024-05-05T08:37:22Z

tomos7231
May 5, 2024

@hankyul2

I would like to apply this code to the 'vit_small_patch16_384' model from timm. How should I modify the code for this purpose?
(I understand that '224' in the given code refers to the image size, but how is '14' determined?)
I apologize if this is due to my lack of knowledge.

attn_map = model.blocks[-1].attn.attn_map.mean(dim=1).squeeze(0).detach()
cls_weight = model.blocks[-1].attn.cls_attn_map.mean(dim=1).view(14, 14).detach()

img_resized = x.permute(1, 2, 0) * 0.5 + 0.5
cls_resized = F.interpolate(cls_weight.view(1, 1, 14, 14), (224, 224), mode='bilinear').view(224, 224, 1)

3 replies

hankyul2 May 6, 2024

Hi @tomos7231

384 means input image resolution to ViT model. If you want to extract an attention map using the code above, you should change the input resolution (224 -> 384) and the dimension of patches (14x14 -> 24x24) like below code.

attn_map = model.blocks[-1].attn.attn_map.mean(dim=1).squeeze(0).detach()
cls_weight = model.blocks[-1].attn.cls_attn_map.mean(dim=1).view(24, 24).detach() # (14->24)

img_resized = x.permute(1, 2, 0) * 0.5 + 0.5
cls_resized = F.interpolate(cls_weight.view(1, 1, 24, 24), (384, 384), mode='bilinear').view(384, 384, 1) # (14->24), (224->384)

14 means the number of patches in each spatial dimension, e.g., H, W of the feature map. This value (14) is determined by patch size (16) because each image is divided by 196 patches, each size is 16x16.
If you just want to extract an attention map, I would recommend @rwightman's solution, which is more convenient.

Thank you.

Hankyul

tomos7231 May 6, 2024

Hi @hankyul2

Well understood. Thank you for reply!

hibiki-iwanaga Nov 3, 2024

@hankyul2
Thank you for such an insightful explanation above.

I also tried to apply this code to the 'vit_small_patch16_384' model as shown above, but I encountered the following error.

RuntimeError                              Traceback (most recent call last)
Cell In[26], line 24
     20 # y = model(image)
     23 attn_map = model.blocks[-1].attn.attn_map.mean(dim=1).squeeze(0).detach()
---> 24 cls_weight = model.blocks[-1].attn.cls_attn_map.mean(dim=1).view(24, 24).detach()
     26 img_resized = image2.permute(0, 2, 3, 1) * 0.5 + 0.5
     27 cls_resized = F.interpolate(cls_weight.view(1, 1, 24, 24), (384, 384), mode='bilinear').view(384, 384, 1)

RuntimeError: shape '[24, 24]' is invalid for input of size 575

2.If I want to apply this model to a regression problem and estimate multiple parameters with a single model, is it possible to check the attention map for each parameter? In a multi-class classification problem, we can check the attention map when a certain class is classified to see which areas are being focused on to recognize that class. However, in the case of a regression problem, how should this be done?

rwightman · 2024-05-05T16:56:07Z

rwightman
May 5, 2024
Maintainer

So, this doesn't include the visualization helpers yet, but have added a simpler extraction helper to get the attention activations via one of two methods, fx or hooks.

WIP but can be seen https://github.com/huggingface/pytorch-image-models/pull/2168/files#diff-358e0d5feb2c109ff53d21bc4fa8a6af94566be622b0f1167316216b0036b8b3

import timm
import torch
from timm.utils import AttentionExtract
timm.layers.set_fused_attn(False)
mm = timm.create_model('vit_base_patch16_224')
input = torch.randn(2,3,224,224)
ee = AttentionExtract(mm, method='fx')
oo = ee(input)

for n, t in oo.items():
    print(n, t.shape)
blocks.0.attn.softmax torch.Size([2, 12, 197, 197])
blocks.1.attn.softmax torch.Size([2, 12, 197, 197])
blocks.2.attn.softmax torch.Size([2, 12, 197, 197])
blocks.3.attn.softmax torch.Size([2, 12, 197, 197])
blocks.4.attn.softmax torch.Size([2, 12, 197, 197])
blocks.5.attn.softmax torch.Size([2, 12, 197, 197])
blocks.6.attn.softmax torch.Size([2, 12, 197, 197])
blocks.7.attn.softmax torch.Size([2, 12, 197, 197])
blocks.8.attn.softmax torch.Size([2, 12, 197, 197])
blocks.9.attn.softmax torch.Size([2, 12, 197, 197])
blocks.10.attn.softmax torch.Size([2, 12, 197, 197])
blocks.11.attn.softmax torch.Size([2, 12, 197, 197])

10 replies

Astroboy-01 Jun 8, 2024

Hello @rwightman, thank you for showing us how to extract the attention layers and maintaining your wonderful timm library. I would like to ask, I am using fast.ai alongside with timm models. I've trained a ViT for a classification task in my own dataset. What would be the best way to load the weights of my ViT and visualize the attention activations over my input image using the timm visualization helpers?, thank you

SarthakJShetty Jun 26, 2024

I have a bit of a trivial question here and slightly off-topic to what @rwightman discussed, but related to the attention map that @hankyul2 posted in their initial visualization:

I'm assuming that the ViT in OPs question is being trained for a classification task. Is it necessary that the attention map must contain high activations along the diagonal, similar to the attention maps generated while training seq2seq models? This seems quite unintuitive to me, as I wouldn't expect a classification ViT to attend to self-attend to patches in such a manner. I would instead expect every patch to attend to certain "interesting" parts of the image, almost like having certain columns in the attention map with high activations.

Is my understanding incorrect, or should we expect the attention map to have high activations along the diagonal?

Thank you in advance!

arnavs04 Jun 26, 2024

@SarthakJShetty You can see the attention maps here after every block for DeiT, in the initial layers you can see the clear diagonal and how the attention maps then change. Please check out my notebook. It contains visualizations for attention rollout but also without. I hope it can be of help.

My assumption, or rather intuition, behind these diagonal patterns is centered around the class token, which I believe is the reason we don't see information flow immediately, as the class token takes in information from the tokens without much change. However, this is what I think for shallow layers; I only believe that in the later and deeper layers, the attention starts to play out.

SarthakJShetty Jun 26, 2024

Thank you for the clarification @arnavs04! This helps. Now that I think about it, the visualizations that I'm observing are almost always like Attention Map 5 and beyond.

Quick question: when you say "the attention starts to play out.", you mean when the queries actually start attending to relevant parts (and not just predominately self-attending in a diagonal fashion) of the image correct? i.e when the attention actually starts looking non-diagonal and like the attention maps 5-12?

arnavs04 Jun 26, 2024

@SarthakJShetty
One of the major reasons why vision transformers (ViTs) are thought to be “better” than CNNs is their ability to share global contextual information right from the beginning (i.e., in the shallow layers). Unlike CNNs, which progress from local to global information, ViTs can, in theory, access global information from the outset. However, attention maps show that not much global information is actually shared in the early stages. This has led to many works proposing modifications to optimize the self-attention mechanism in vision transformers, effectively “tuning” them similarly to CNNs.

In the diagram, you can see how information moves gradually from local to global, resembling the behaviour of a CNN (hence the diagonal structure slowly transitioning to a more globally attended pattern). This example might not be fully representative, as there is usually some global information shared even in the early stages of a vision transformer.

Here is a paper that maybe clear your doubts: Do Vision Transformers See Like Convolutional Neural Networks. I haven't gone through it completely as I had kept it on my reading list.

I am currently working on Vision Transformers. If you have any questions or would like to discuss them with me, let me know! I'll be glad to help!

rwightman · 2024-05-22T20:50:48Z

rwightman
May 22, 2024
Maintainer

FYI there's a fix on main for the node/module matching so that outputs will remain in order of traversal (usually matches order of forward pass, at least for timm models) regardless of how many matching names/wildcards are specified.

0 replies

zichunxx · 2024-08-05T14:04:50Z

zichunxx
Aug 5, 2024

Hi! @hankyul2 Thanks for your excellent explanation above.

I understood most of it but was still confused about attn_obj.cls_attn_map = attn[:, :, 0, 2:].

Why cls_attn_map is extracted based on the dimension index of attn[:, :, 0, 2:]?

Thanks!

5 replies

arnavs04 Aug 5, 2024

This is deit architecture which has both a class token and a distillation token for prediction.
When we do index 0 we are looking at the similarity scores of all other tokens wrt to class token. And now when we do 2: it means, that we're only looking at the similarity tokens of all the patch tokens wrt class token (excluding the class token and distillation token itself)

zichunxx Aug 6, 2024

Thanks for your response @arnavs04! The similarity scores you mentioned are the dot product of two vectors from the query and key matrices. Is that right?

arnavs04 Aug 6, 2024

@zichunxx Yup exactly! The (n+2) x (n+2) attention matrix.

zichunxx Aug 10, 2024

Thanks for your generous help! @arnavs04 I have read your notebook which is very thorough and helpful.

I have noticed that some vision transformers are implemented as an encoder without the cls token. In this situation, how do we plot the overlaid image to illustrate which patch is watched with a higher weight? Thanks!

arnavs04 Oct 8, 2024

I apologize for the delay, I didn't see the reply.

The attribution map is resized with bilinear interpolation to fit the H x W resolution as the original map. This heatmap is now taken and with 0.5 x heat_map + 0.5 x original_image we get our saliency map. Obviously you can tweak the values instead of 0.5 and 0.5 respectively. This is the goal of post-hoc model agnostic explainability methods for vision transformers.