Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut

2021-02-17CVPR 2021 1Question Answering Caption Generation Image Captioning Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code Code(official)Code

Abstract

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

Results

Task	Dataset	Metric	Value	Model
Image Captioning	nocaps-val-out-domain	CIDEr	94.5	Enc-Dec
Image Captioning	nocaps-val-out-domain	SPICE	11.9	Enc-Dec
Image Captioning	nocaps-val-near-domain	CIDEr	88.3	Enc-Dec
Image Captioning	nocaps-val-near-domain	SPICE	12.1	Enc-Dec
Image Captioning	nocaps-val-overall	CIDEr	90.2	Enc-Dec
Image Captioning	nocaps-val-overall	SPICE	12.1	Enc-Dec
Image Captioning	nocaps-val-in-domain	CIDEr	92.6	Enc-Dec
Image Captioning	nocaps-val-in-domain	SPICE	12.5	Enc-Dec

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16