RefineCap: Concept-Aware Refinement for Image Captioning

Yekun Chai, Shuo Jin, Junliang Xing

2021-09-08TAG Descriptive Scene Understanding Image Captioning Language Modelling

Abstract

Automatically translating images to texts involves image scene understanding and language modeling. In this paper, we propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided visual semantics, and implicitly learns the mapping between visual tag words and images. The proposed Visual-Concept Refinement method can allow the generator to attend to semantic details in the image, thereby generating more semantically descriptive captions. Our model achieves superior performance on the MS-COCO dataset in comparison with previous visual-concept based models.

Results

Task	Dataset	Metric	Value	Model
Image Captioning	COCO Captions	BLEU-1	80.2	RefineCap (w/ REINFORCE)
Image Captioning	COCO Captions	BLEU-2	64.5	RefineCap (w/ REINFORCE)
Image Captioning	COCO Captions	BLEU-3	49.9	RefineCap (w/ REINFORCE)
Image Captioning	COCO Captions	BLEU-4	37.8	RefineCap (w/ REINFORCE)
Image Captioning	COCO Captions	CIDER	127.2	RefineCap (w/ REINFORCE)
Image Captioning	COCO Captions	METEOR	28.3	RefineCap (w/ REINFORCE)
Image Captioning	COCO Captions	ROUGE-L	58	RefineCap (w/ REINFORCE)
Image Captioning	COCO Captions	SPICE	22.5	RefineCap (w/ REINFORCE)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17 Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17 Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17