ClipCap: CLIP Prefix for Image Captioning

Ron Mokady, Amir Hertz, Amit H. Bermano

2021-11-18Image Captioning Language Modelling

Abstract

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.

Results

Task	Dataset	Metric	Value	Model
Image Captioning	nocaps near-domain	CIDEr	67.69	ClipCap (MLP + GPT2 tuning)
Image Captioning	nocaps near-domain	SPICE	11.26	ClipCap (MLP + GPT2 tuning)
Image Captioning	nocaps near-domain	CIDEr	66.82	ClipCap (Transformer)
Image Captioning	nocaps near-domain	SPICE	10.92	ClipCap (Transformer)
Image Captioning	Conceptual Captions	CIDEr	87.26	ClipCap (MLP + GPT2 tuning)
Image Captioning	Conceptual Captions	ROUGE-L	26.71	ClipCap (MLP + GPT2 tuning)
Image Captioning	Conceptual Captions	SPICE	18.5	ClipCap (MLP + GPT2 tuning)
Image Captioning	Conceptual Captions	CIDEr	71.82	ClipCap (Transformer)
Image Captioning	Conceptual Captions	ROUGE-L	25.12	ClipCap (Transformer)
Image Captioning	Conceptual Captions	SPICE	16.07	ClipCap (Transformer)
Image Captioning	nocaps entire	CIDEr	65.83	ClipCap (Transformer)
Image Captioning	nocaps entire	SPICE	10.86	ClipCap (Transformer)
Image Captioning	nocaps entire	CIDEr	65.7	ClipCap (MLP + GPT2 tuning)
Image Captioning	nocaps entire	SPICE	11.1	ClipCap (MLP + GPT2 tuning)
Image Captioning	COCO Captions	BLEU-4	33.53	ClipCap (Transformer)
Image Captioning	COCO Captions	CIDER	113.08	ClipCap (Transformer)
Image Captioning	COCO Captions	METEOR	27.45	ClipCap (Transformer)
Image Captioning	COCO Captions	SPICE	21.05	ClipCap (Transformer)
Image Captioning	COCO Captions	BLEU-4	32.15	ClipCap (MLP + GPT2 tuning)
Image Captioning	COCO Captions	CIDER	108.35	ClipCap (MLP + GPT2 tuning)
Image Captioning	COCO Captions	METEOR	27.1	ClipCap (MLP + GPT2 tuning)
Image Captioning	COCO Captions	SPICE	20.12	ClipCap (MLP + GPT2 tuning)
Image Captioning	nocaps out-of-domain	CIDEr	49.35	ClipCap (MLP + GPT2 tuning)
Image Captioning	nocaps out-of-domain	SPICE	9.7	ClipCap (MLP + GPT2 tuning)
Image Captioning	nocaps out-of-domain	CIDEr	49.14	ClipCap (Transformer)
Image Captioning	nocaps out-of-domain	SPICE	9.57	ClipCap (Transformer)
Image Captioning	nocaps in-domain	CIDEr	84.85	ClipCap (Transformer)
Image Captioning	nocaps in-domain	SPICE	12.14	ClipCap (Transformer)
Image Captioning	nocaps in-domain	CIDEr	79.73	ClipCap (MLP + GPT2 tuning)
Image Captioning	nocaps in-domain	SPICE	12.2	ClipCap (MLP + GPT2 tuning)

ClipCap: CLIP Prefix for Image Captioning

Abstract

Results

Related Papers

ClipCap: CLIP Prefix for Image Captioning

Abstract

Results

Related Papers