TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Beyond a Pre-Trained Object Detector: Cross-Modal Textual ...

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Chia-Wen Kuo, Zsolt Kira

2022-05-09CVPR 2022 1Image Captioning
PaperPDFCode(official)

Abstract

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.

Results

TaskDatasetMetricValueModel
Image CaptioningCOCO CaptionsBLEU-183.4Xmodal-Ctx
Image CaptioningCOCO CaptionsBLEU-441.4Xmodal-Ctx
Image CaptioningCOCO CaptionsCIDER139.9Xmodal-Ctx
Image CaptioningCOCO CaptionsMETEOR30.4Xmodal-Ctx
Image CaptioningCOCO CaptionsROUGE-L60.4Xmodal-Ctx
Image CaptioningCOCO CaptionsSPICE24Xmodal-Ctx
Image CaptioningCOCO CaptionsBLEU-441.3Xmodal-Ctx + OSCAR
Image CaptioningCOCO CaptionsCIDER142.2Xmodal-Ctx + OSCAR
Image CaptioningCOCO CaptionsSPICE24.9Xmodal-Ctx + OSCAR
Image CaptioningCOCO CaptionsBLEU-181.5Xmodal-Ctx
Image CaptioningCOCO CaptionsBLEU-439.7Xmodal-Ctx
Image CaptioningCOCO CaptionsCIDER135.9Xmodal-Ctx
Image CaptioningCOCO CaptionsMETEOR30Xmodal-Ctx
Image CaptioningCOCO CaptionsROUGE-L59.5Xmodal-Ctx
Image CaptioningCOCO CaptionsSPICE23.7Xmodal-Ctx

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11Edit Flows: Flow Matching with Edit Operations2025-06-10Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10