Scaling Up Vision-Language Pre-training for Image Captioning

Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

2021-11-24CVPR 2022 1Attribute Image Captioning

Abstract

In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.

Results

Task	Dataset	Metric	Value	Model
Image Captioning	nocaps entire	B1	85.62	Microsoft Cognitive Services team
Image Captioning	nocaps entire	B2	71.36	Microsoft Cognitive Services team
Image Captioning	nocaps entire	B3	53.62	Microsoft Cognitive Services team
Image Captioning	nocaps entire	B4	34.65	Microsoft Cognitive Services team
Image Captioning	nocaps entire	CIDEr	114.25	Microsoft Cognitive Services team
Image Captioning	nocaps entire	METEOR	31.27	Microsoft Cognitive Services team
Image Captioning	nocaps entire	ROUGE-L	61.2	Microsoft Cognitive Services team
Image Captioning	nocaps entire	SPICE	14.85	Microsoft Cognitive Services team
Image Captioning	nocaps-val-out-domain	CIDEr	111.3	LEMON_large
Image Captioning	nocaps-val-out-domain	SPICE	14	LEMON_large
Image Captioning	nocaps-val-near-domain	CIDEr	113.3	LEMON_large
Image Captioning	nocaps-val-near-domain	SPICE	15.1	LEMON_large
Image Captioning	COCO Captions	BLEU-4	42.6	LEMON
Image Captioning	COCO Captions	CIDER	145.5	LEMON
Image Captioning	COCO Captions	METEOR	31.4	LEMON
Image Captioning	COCO Captions	SPICE	25.5	LEMON
Image Captioning	nocaps-val-overall	CIDEr	113.4	LEMON_large
Image Captioning	nocaps-val-overall	SPICE	15	LEMON_large
Image Captioning	nocaps-val-in-domain	CIDEr	116.9	LEMON_large
Image Captioning	nocaps-val-in-domain	SPICE	15.8	LEMON_large
Image Captioning	nocaps-val-in-domain	CIDEr	107.7	LEMON_base
Image Captioning	nocaps-val-in-domain	SPICE	14.7	LEMON_base
Image Captioning	nocaps-XD entire	B1	85.62	Microsoft Cognitive Services team
Image Captioning	nocaps-XD entire	B2	71.36	Microsoft Cognitive Services team
Image Captioning	nocaps-XD entire	B3	53.62	Microsoft Cognitive Services team
Image Captioning	nocaps-XD entire	B4	34.65	Microsoft Cognitive Services team
Image Captioning	nocaps-XD entire	CIDEr	114.25	Microsoft Cognitive Services team
Image Captioning	nocaps-XD entire	METEOR	31.27	Microsoft Cognitive Services team
Image Captioning	nocaps-XD entire	ROUGE-L	61.2	Microsoft Cognitive Services team
Image Captioning	nocaps-XD entire	SPICE	14.85	Microsoft Cognitive Services team

Scaling Up Vision-Language Pre-training for Image Captioning

Abstract

Results

Related Papers

Scaling Up Vision-Language Pre-training for Image Captioning

Abstract

Results

Related Papers