Ron Mokady, Amir Hertz, Amit H. Bermano
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Captioning | nocaps near-domain | CIDEr | 67.69 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | nocaps near-domain | SPICE | 11.26 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | nocaps near-domain | CIDEr | 66.82 | ClipCap (Transformer) |
| Image Captioning | nocaps near-domain | SPICE | 10.92 | ClipCap (Transformer) |
| Image Captioning | Conceptual Captions | CIDEr | 87.26 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | Conceptual Captions | ROUGE-L | 26.71 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | Conceptual Captions | SPICE | 18.5 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | Conceptual Captions | CIDEr | 71.82 | ClipCap (Transformer) |
| Image Captioning | Conceptual Captions | ROUGE-L | 25.12 | ClipCap (Transformer) |
| Image Captioning | Conceptual Captions | SPICE | 16.07 | ClipCap (Transformer) |
| Image Captioning | nocaps entire | CIDEr | 65.83 | ClipCap (Transformer) |
| Image Captioning | nocaps entire | SPICE | 10.86 | ClipCap (Transformer) |
| Image Captioning | nocaps entire | CIDEr | 65.7 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | nocaps entire | SPICE | 11.1 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | COCO Captions | BLEU-4 | 33.53 | ClipCap (Transformer) |
| Image Captioning | COCO Captions | CIDER | 113.08 | ClipCap (Transformer) |
| Image Captioning | COCO Captions | METEOR | 27.45 | ClipCap (Transformer) |
| Image Captioning | COCO Captions | SPICE | 21.05 | ClipCap (Transformer) |
| Image Captioning | COCO Captions | BLEU-4 | 32.15 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | COCO Captions | CIDER | 108.35 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | COCO Captions | METEOR | 27.1 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | COCO Captions | SPICE | 20.12 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | nocaps out-of-domain | CIDEr | 49.35 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | nocaps out-of-domain | SPICE | 9.7 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | nocaps out-of-domain | CIDEr | 49.14 | ClipCap (Transformer) |
| Image Captioning | nocaps out-of-domain | SPICE | 9.57 | ClipCap (Transformer) |
| Image Captioning | nocaps in-domain | CIDEr | 84.85 | ClipCap (Transformer) |
| Image Captioning | nocaps in-domain | SPICE | 12.14 | ClipCap (Transformer) |
| Image Captioning | nocaps in-domain | CIDEr | 79.73 | ClipCap (MLP + GPT2 tuning) |
| Image Captioning | nocaps in-domain | SPICE | 12.2 | ClipCap (MLP + GPT2 tuning) |