TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ImageBERT: Cross-modal Pre-training with Large-scale Weak-...

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti

2020-01-22Zero-Shot Cross-Modal RetrievalImage-text matchingText MatchingText RetrievalMasked Language ModelingRetrievalLanguage ModellingImage Retrieval
PaperPDF

Abstract

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks, and achieve new state-of-the-art results on both MSCOCO and Flickr30k datasets.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@170.7ImageBERT
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1094ImageBERT
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@590.2ImageBERT
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@154.3ImageBERT
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1087.5ImageBERT
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@579.6ImageBERT
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@144ImageBERT
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1080.4ImageBERT
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@571.2ImageBERT
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@132.3ImageBERT
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1070.2ImageBERT
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@559ImageBERT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17