Learning Deep Features for Discriminative Localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba

2015-12-14CVPR 2016 6Object Localization Weakly-Supervised Object Localization

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code(official)Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them

Results

Task	Dataset	Metric	Value	Model
Object Localization	ILSVRC 2015	Top-1 Error Rate	67.19	AlexNet-GAP
Object Localization	ILSVRC 2016	Top-5 Error	45.14	VGGnet-GAP
Object Localization	ILSVRC 2016	Top-5 Error	52.16	AlexNet-GAP
Object Localization	Tiny ImageNet	Top-1 Localization Accuracy	40.55	CAM

Related Papers

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28 VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding2025-06-28 RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base2025-06-23 CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion2025-06-17 UAV Object Detection and Positioning in a Mining Industrial Metaverse with Custom Geo-Referenced Data2025-06-16 WoMAP: World Models For Embodied Open-Vocabulary Object Localization2025-06-02 Multispectral Detection Transformer with Infrared-Centric Sensor Fusion2025-05-21 Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels2025-05-20