# Introduction

- The paper presents gradient computation based techniques to visualise image classification models.
- Link to the paper

# Experimental Setup

- Single deep convNet trained on ILSVRC-2013 dataset (1.2M training images and 1000 classes).
- Weight layer configuration is: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000.

# Class Model Visualisation

- Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant).
- Add the mean image (for training set) to the resulting image.
- The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes.

# Image-Specific Class Saliency Visualisation

- Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores.
- Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class.
- The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score.

# Class Saliency Extraction

- Find the derivative of the class score with respect with respect to the input image.
- This would result in one single saliency map per colour channel.
- To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels.

# Weakly Supervised Object Localisation

- The saliency map for an image provides a rough encoding of the location of the object of the class of interest.
- Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation.
- Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image.
- This weakly supervised approach achieves 46.4% top-5 error on the test set of ILSVRC-2013.

# Relation to Deconvolutional Networks

- DeconvNet-based reconstruction of the n-th layer input is similar to computing the gradient of the visualised neuron activity f with respect to the input layer.
- One difference is in the way RELU neurons are treated: In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input.