Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara

2017-06-26Saliency Prediction Image Captioning

Abstract

Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to captioning baselines with and without saliency, and to different state of the art approaches combining saliency and captioning.

Results

Task	Dataset	Metric	Value	Model
Image Captioning	Flickr30k Captions test	BLEU-4	21.3	Cornia et al
Image Captioning	Flickr30k Captions test	CIDEr	46.4	Cornia et al
Image Captioning	Flickr30k Captions test	METEOR	20	Cornia et al

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28 HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12 ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11 A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11 Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11 Edit Flows: Flow Matching with Edit Operations2025-06-10 Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10