Deep Learning
How Your Neural Net Sees Cats & Dogs!
Ever wonder how your deep learning model knows a blurry blob of pixels is a cat? I mean, it doesn’t have eyes, or emotions, or even a Netflix account full of cat documentaries. But somehow, it nails it.
Welcome to the world of Class Activation Maps (CAMs) — the X-ray vision into your neural network’s brain. Class activation maps help us to visualise how our classifier model came to a certain decision or how did it classify an image to a certain class. Before diving into class activation maps it is important to understand the basics of CNN and maybe a couple of deep learning architectures. This article about the basics of CNN could be a good starting point.
What is a CAM?
Lets try to build an intuition of class activation maps. Class activation maps, technically speaking highlights the parts of the image that were most influential in the model’s decision for a specific class. So if your model says, “Yep, that’s a cat,” the CAM will show the regions it paid attention to when making that decision — like the ears, the eyes, or maybe just the smug expression. Class activation maps are extremely helpful in not just adding interpretability to the model but also understand the important parts in an image. Imagine you built a tumour classifier model which tells you whether the given image has a tumour in it or not. Class activation maps will tell you where did the model look in the image to get to that decision.
A Brief History of CAMs
Class Activation Maps were introduced in 2016 by Zhou et al. in their paper titled “Learning Deep Features for Discriminative Localisation.” Their goal? To open the black box of CNNs and give us a visual explanation of why a model picked a certain class. The original CAM technique was a clever hack: they used Global Average Pooling (GAP) before the final classification layer to preserve spatial information, allowing the model to be interpretable without changing its performance. Since then, CAMs have evolved — spawning more flexible versions like Grad-CAM, Grad-CAM++, and Score-CAM — each aimed at extending this interpretability to broader model architectures without the architectural constraints of the original CAM.
Lets try to understand CAMs in a bit more detail. Lets look at a general CNN architecture.
If you see as we go from left to right in the CNN layers we are essentially extracting features from an image. These features get more high level as we add more layers to our model. So, for instance, a CNN which identifies cats might learn low level features like edges and curves in the initial layers of the model and towards the final layers it might understand whiskers, tails and fur for example. These feature maps are what we are feeding into the final stage of the model usually called the fully connected layer.
OK lets revise. A CNN takes an image (say, 224×224 pixels), passes it through multiple layers of filters, and ends up with a much smaller spatial map. For example, ResNet might end with a 7x7x512 feature map:
- 512: number of filters (each detecting a specific pattern)
- 7×7: spatial grid (each cell represents a big chunk of the original image)
At this point, the network has ditched the fine details and is now thinking in abstract patterns like “pointy shapes” or “striped textures.” So how do we get to interpretability from these features?
Computing CAM
Lets break down the computation of CAM from these feature maps. This is how the original paper proposed computing CAMs
- Final Feature Maps: You take the output from the last convolutional layer — think of it as a stack of 512 features, each detecting a different type of property in the original (cat) image (whiskers, maybe?).
- Global Average Pooling (GAP): This shrinks each 7×7 map into a single value by averaging. Now you’ve got a 512-dimensional vector.
- Fully Connected (FC) Layer: These 512 numbers are multiplied by weights specific to each class (e.g., “cat”, “dog”, etc.)
- To Create CAM: For the class “cat,” we take the weights assigned to “cat” and multiply them back across the original 7×7 feature maps.
- Result: You get a 7×7 map that shows where the cat-ness is strongest. You upsample that to the original image size and overlay it with pretty colours.
Mathematically this can be expressed as follows:
For a given image activation of unit k in the last convolutional layer at spatial location (x, y) can be represented as
Then, for unit k, the result of performing global average pooling is
Thus, for a given class c, the input to the softmax is
By substituting F into the above equation and rearranging
We define Mc as the class activation map for class c, where each spatial element is given by
Thus
Hence M directly indicates the importance of the activation at spatial grid (x, y) leading to the classification of an image to class c.
There you go you have got your class activation map! Lets look at a quick visual example
CAM for a Cat!
We will use a pre-trained resnet18 to visualise a CAM on a cat image.
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import cv2
from io import BytesIO# Load pretrained ResNet18
model = models.resnet18(pretrained=True)
model.eval()
We will import the necessary libraries to run the inference on resnet18 and plot the maps.
# Hook to capture output of the last conv layer
feature_maps = []def hook_fn(module, input, output):
feature_maps.append(output.detach())
model.layer4.register_forward_hook(hook_fn)
# Load a sample image
url = "https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg"
img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
# Transform for ResNet
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
This snippet captures the output of the last convolution layer into a list called feature_maps. We will use a public URL to download a cat image and run it through our model. Before we run the image for inference we have to define a transform so that the image adheres to the input sizes of our model.
input_tensor = transform(img).unsqueeze(0)# Forward pass
with torch.no_grad():
_ = model(input_tensor)
# Extract feature maps from last conv layer
fm = feature_maps[0][0] # Shape: [512, 7, 7]
# Pick top 4 channels by average activation
top_indices = torch.topk(fm.mean(dim=(1, 2)), 4).indices
# Plot image + top feature maps
fig, axes = plt.subplots(1, 5, figsize=(20, 5))
axes[0].imshow(img)
axes[0].set_title("Original Image")
axes[0].axis('off')
for i, idx in enumerate(top_indices):
fmap = fm[idx].numpy()
sns.heatmap(fmap, ax=axes[i+1], cbar=False, cmap='viridis')
axes[i+1].set_title(f"Feature Map {idx.item()}")
axes[i+1].axis('off')
plt.tight_layout()
plt.show()
Finally we will run this image through our model and plot the feature maps. Since there are 512 feature maps we will only choose feature maps which have the most activation.
Lets now take these feature maps and convert them to CAMs.
# Get feature maps and class weights
features = feature_maps[0] # Shape: [1, 512, 7, 7]
features = features.squeeze(0) # [512, 7, 7]# Get class index for "cat" (or use argmax)
output = model(input_tensor)
class_idx = output.argmax().item()
# Get weights from the FC layer
params = list(model.parameters())
fc_weights = params[-2] # Shape: [num_classes, 512]
weights_for_class = fc_weights[class_idx] # Shape: [512]
# Compute CAM
cam = torch.zeros(7, 7)
for i in range(512):
cam += weights_for_class[i] * features[i]
# ReLU
cam = torch.relu(cam)
# Normalize and upscale
cam = cam - cam.min()
cam = cam / cam.max()
cam_np = cam.detach().cpu().numpy()
cam_resized = cv2.resize(cam_np, (224, 224))
# Overlay on original image
plt.imshow(img)
plt.imshow(cam_resized, cmap='jet', alpha=0.5)
plt.title(f"CAM for class: {class_idx}")
plt.axis('off')
plt.show()
Lets look at what we are doing here. We take the feature maps generated by our resnet18 (512x7x7) and also get the weights for the class we are interested in i.e. cat. We then multiply the feature maps with the class weights to get the cam. Once we get the CAM we resize it to 224×224 which is the original size of our image. Once we get the CAM we overlay it on top of the original image.
In the above two class activation maps we can see that the model looks at the cat face to figure out whether it is a cat or not. In the second image we see that the model is also looking at the cat’s limbs and body to classify it! The best part is that the model learns these features on its own. Remember that we haven’t supervised the model to look at these specific features, we only told the model whether there is a cat in this picture or not. Where is the cat in the picture is all figured out by the model itself. This is also called Weakly Supervised Object Localisation (WSOL). In WSOL, the goal is to localise objects in an image (i.e., draw bounding boxes or attention maps) without having bounding box annotations during training.
Instead, the model is trained only with image-level labels like “this image contains a cat” and it has to learn to infer where in the image the object is.
Well now finding CAMs for cat images is fun and kind of useful if you want to understand where did the model look in the image before classifying this. But can this have more interesting applications?
Localising Pathology in Medical Images
CAMs can play an important role in localising pathologies in scans. Imagine you have a model which can detect lets say a pneumonia in a chest X-ray and also can tell you where to look for signs of pneumonia in the x-ray! Lets look at one such example. Similar to this article where we saw image classification of chest x-rays. We will use a chest x-ray dataset from Kaggle. This dataset is from the pneumonia detection challenge and are DICOM images generally used in medical imaging.
If we take a general look at the data it looks like this
I have used a resnet18 model to classify these datasets. After grabbing the feature maps and weights for the fully connected layer in the exact same way as we did for the cat images above. We can generate the CAMs for a few different images.
We can see that the CAM focuses on the area which has pneumonia!
The CAM presented in the original paper relies on the GAP (Global Average Pooling) before the FC layer that is limiting. To fix this we use Grad-CAM. Grad CAM fixes this by using the gradient of the class score with respect to each feature map. That tells you how sensitive the class prediction is to changes in each map. You still do a weighted sum of feature maps, but now the weights come from gradients, not FC layer.
Result? Works with any architecture — no GAP needed.
Limitations of CAMs
While CAMs offer powerful insights into model behaviour, they come with important limitations — especially in high-stakes domains like medical imaging. More critically, CAMs only highlight coarse regions and lack fine-grained localisation. In tasks where precision matters — say, identifying a tiny tumor — CAMs can be too vague to be trustworthy. In short, they tell you where the model is focusing — but not exactly what it’s seeing. So while CAMs and Grad-CAMs are fantastic tools for interpretability and debugging, you wouldn’t want to use them as your sole tool for critical decisions.