TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ClipCap: CLIP Prefix for Image Captioning

ClipCap: CLIP Prefix for Image Captioning

Ron Mokady, Amir Hertz, Amit H. Bermano

2021-11-18Image CaptioningLanguage Modelling
PaperPDFCodeCodeCode(official)Code

Abstract

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.

Results

TaskDatasetMetricValueModel
Image Captioningnocaps near-domainCIDEr67.69ClipCap (MLP + GPT2 tuning)
Image Captioningnocaps near-domainSPICE11.26ClipCap (MLP + GPT2 tuning)
Image Captioningnocaps near-domainCIDEr66.82ClipCap (Transformer)
Image Captioningnocaps near-domainSPICE10.92ClipCap (Transformer)
Image CaptioningConceptual CaptionsCIDEr87.26ClipCap (MLP + GPT2 tuning)
Image CaptioningConceptual CaptionsROUGE-L26.71ClipCap (MLP + GPT2 tuning)
Image CaptioningConceptual CaptionsSPICE18.5ClipCap (MLP + GPT2 tuning)
Image CaptioningConceptual CaptionsCIDEr71.82ClipCap (Transformer)
Image CaptioningConceptual CaptionsROUGE-L25.12ClipCap (Transformer)
Image CaptioningConceptual CaptionsSPICE16.07ClipCap (Transformer)
Image Captioningnocaps entireCIDEr65.83ClipCap (Transformer)
Image Captioningnocaps entireSPICE10.86ClipCap (Transformer)
Image Captioningnocaps entireCIDEr65.7ClipCap (MLP + GPT2 tuning)
Image Captioningnocaps entireSPICE11.1ClipCap (MLP + GPT2 tuning)
Image CaptioningCOCO CaptionsBLEU-433.53ClipCap (Transformer)
Image CaptioningCOCO CaptionsCIDER113.08ClipCap (Transformer)
Image CaptioningCOCO CaptionsMETEOR27.45ClipCap (Transformer)
Image CaptioningCOCO CaptionsSPICE21.05ClipCap (Transformer)
Image CaptioningCOCO CaptionsBLEU-432.15ClipCap (MLP + GPT2 tuning)
Image CaptioningCOCO CaptionsCIDER108.35ClipCap (MLP + GPT2 tuning)
Image CaptioningCOCO CaptionsMETEOR27.1ClipCap (MLP + GPT2 tuning)
Image CaptioningCOCO CaptionsSPICE20.12ClipCap (MLP + GPT2 tuning)
Image Captioningnocaps out-of-domainCIDEr49.35ClipCap (MLP + GPT2 tuning)
Image Captioningnocaps out-of-domainSPICE9.7ClipCap (MLP + GPT2 tuning)
Image Captioningnocaps out-of-domainCIDEr49.14ClipCap (Transformer)
Image Captioningnocaps out-of-domainSPICE9.57ClipCap (Transformer)
Image Captioningnocaps in-domainCIDEr84.85ClipCap (Transformer)
Image Captioningnocaps in-domainSPICE12.14ClipCap (Transformer)
Image Captioningnocaps in-domainCIDEr79.73ClipCap (MLP + GPT2 tuning)
Image Captioningnocaps in-domainSPICE12.2ClipCap (MLP + GPT2 tuning)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16