Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis

Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao

2025-01-16Explainable Artificial Intelligence (XAI)Interpretable Machine Learning Explainable Models Fine-Grained Image Classification Visual Prompt Tuning

Paper PDF Code

Abstract

We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.

Related Papers

NeuroXAI: Adaptive, robust, explainable surrogate framework for determination of channel importance in EEG application2025-09-12 Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey2025-07-09 Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization2025-07-03 Attention to Burstiness: Low-Rank Bilinear Prompt Tuning2025-06-28 Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis2025-06-26 Towards Transparent AI: A Survey on Explainable Large Language Models2025-06-26 Communicating Smartly in the Molecular Domain: Neural Networks in the Internet of Bio-Nano Things2025-06-25 Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification2025-06-25