Text-Only Training for Image Captioning using Noise-Injected CLIP

David Nukrai, Ron Mokady, Amir Globerson

2022-11-01Image Captioning Semi Supervised Learning for Image Captioning Language Modelling

Abstract

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.

Results

Task	Dataset	Metric	Value	Model
Image Captioning	MSCOCO	BLEU-4	26.4	CapDec
Image Captioning	COCO Captions	BLEU-4	26.4	CapDec
Image Captioning	COCO Captions	CIDER	91.8	CapDec
Image Captioning	COCO Captions	METEOR	25.1	CapDec
Image Captioning	FlickrStyle10K	BLEU-1 (Romantic)	29.4	CapDec
Image Captioning	Flickr30k	CIDEr	39.1	CapDec
Image Captioning	FlickrStyle10K	CIDEr	30	CapDec
Semi Supervised Learning for Image Captioning	Flickr30k	CIDEr	39.1	CapDec
Semi Supervised Learning for Image Captioning	FlickrStyle10K	CIDEr	30	CapDec

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Assay2Mol: large language model-based drug design using BioAssay context2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16