TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scaling Up Vision-Language Pre-training for Image Captioning

Scaling Up Vision-Language Pre-training for Image Captioning

Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

2021-11-24CVPR 2022 1AttributeImage Captioning
PaperPDF

Abstract

In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.

Results

TaskDatasetMetricValueModel
Image Captioningnocaps entireB185.62Microsoft Cognitive Services team
Image Captioningnocaps entireB271.36Microsoft Cognitive Services team
Image Captioningnocaps entireB353.62Microsoft Cognitive Services team
Image Captioningnocaps entireB434.65Microsoft Cognitive Services team
Image Captioningnocaps entireCIDEr114.25Microsoft Cognitive Services team
Image Captioningnocaps entireMETEOR31.27Microsoft Cognitive Services team
Image Captioningnocaps entireROUGE-L61.2Microsoft Cognitive Services team
Image Captioningnocaps entireSPICE14.85Microsoft Cognitive Services team
Image Captioningnocaps-val-out-domainCIDEr111.3LEMON_large
Image Captioningnocaps-val-out-domainSPICE14LEMON_large
Image Captioningnocaps-val-near-domainCIDEr113.3LEMON_large
Image Captioningnocaps-val-near-domainSPICE15.1LEMON_large
Image CaptioningCOCO CaptionsBLEU-442.6LEMON
Image CaptioningCOCO CaptionsCIDER145.5LEMON
Image CaptioningCOCO CaptionsMETEOR31.4LEMON
Image CaptioningCOCO CaptionsSPICE25.5LEMON
Image Captioningnocaps-val-overallCIDEr113.4LEMON_large
Image Captioningnocaps-val-overallSPICE15LEMON_large
Image Captioningnocaps-val-in-domainCIDEr116.9LEMON_large
Image Captioningnocaps-val-in-domainSPICE15.8LEMON_large
Image Captioningnocaps-val-in-domainCIDEr107.7LEMON_base
Image Captioningnocaps-val-in-domainSPICE14.7LEMON_base
Image Captioningnocaps-XD entireB185.62Microsoft Cognitive Services team
Image Captioningnocaps-XD entireB271.36Microsoft Cognitive Services team
Image Captioningnocaps-XD entireB353.62Microsoft Cognitive Services team
Image Captioningnocaps-XD entireB434.65Microsoft Cognitive Services team
Image Captioningnocaps-XD entireCIDEr114.25Microsoft Cognitive Services team
Image Captioningnocaps-XD entireMETEOR31.27Microsoft Cognitive Services team
Image Captioningnocaps-XD entireROUGE-L61.2Microsoft Cognitive Services team
Image Captioningnocaps-XD entireSPICE14.85Microsoft Cognitive Services team

Related Papers

MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Non-Adaptive Adversarial Face Generation2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Attributes Shape the Embedding Space of Face Recognition Models2025-07-15COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation2025-07-15Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13Model Parallelism With Subnetwork Data Parallelism2025-07-11Bradley-Terry and Multi-Objective Reward Modeling Are Complementary2025-07-10