TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RefineCap: Concept-Aware Refinement for Image Captioning

RefineCap: Concept-Aware Refinement for Image Captioning

Yekun Chai, Shuo Jin, Junliang Xing

2021-09-08TAGDescriptiveScene UnderstandingImage CaptioningLanguage Modelling
PaperPDF

Abstract

Automatically translating images to texts involves image scene understanding and language modeling. In this paper, we propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided visual semantics, and implicitly learns the mapping between visual tag words and images. The proposed Visual-Concept Refinement method can allow the generator to attend to semantic details in the image, thereby generating more semantically descriptive captions. Our model achieves superior performance on the MS-COCO dataset in comparison with previous visual-concept based models.

Results

TaskDatasetMetricValueModel
Image CaptioningCOCO CaptionsBLEU-180.2RefineCap (w/ REINFORCE)
Image CaptioningCOCO CaptionsBLEU-264.5RefineCap (w/ REINFORCE)
Image CaptioningCOCO CaptionsBLEU-349.9RefineCap (w/ REINFORCE)
Image CaptioningCOCO CaptionsBLEU-437.8RefineCap (w/ REINFORCE)
Image CaptioningCOCO CaptionsCIDER127.2RefineCap (w/ REINFORCE)
Image CaptioningCOCO CaptionsMETEOR28.3RefineCap (w/ REINFORCE)
Image CaptioningCOCO CaptionsROUGE-L58RefineCap (w/ REINFORCE)
Image CaptioningCOCO CaptionsSPICE22.5RefineCap (w/ REINFORCE)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17