TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VL-BERT: Pre-training of Generic Visual-Linguistic Represe...

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai

2019-08-22ICLR 2020 1Question AnsweringReferring ExpressionImage-text matchingReferring Expression ComprehensionVisual Question Answering (VQA)Visual Commonsense ReasoningLanguage ModellingVisual Question Answering
PaperPDFCode(official)CodeCode

Abstract

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{https://github.com/jackroos/VL-BERT}.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VCR (Q-AR) testAccuracy59.7VL-BERTLARGE
Visual Question Answering (VQA)VCR (Q-AR) devAccuracy58.9VL-BERTLARGE
Visual Question Answering (VQA)VCR (Q-AR) devAccuracy55.2VL-BERTBASE
Visual Question Answering (VQA)VCR (Q-A) devAccuracy75.5VL-BERTLARGE
Visual Question Answering (VQA)VCR (Q-A) devAccuracy73.8VL-BERTBASE
Visual Question Answering (VQA)VCR (QA-R) devAccuracy77.9VL-BERTLARGE
Visual Question Answering (VQA)VCR (QA-R) devAccuracy74.4VL-BERTBASE
Visual Question Answering (VQA)VCR (QA-R) testAccuracy78.4VL-BERTLARGE
Visual Question Answering (VQA)VCR (Q-A) testAccuracy75.8VL-BERTLARGE
Visual Question Answering (VQA)VQA v2 test-devAccuracy71.79VL-BERTLARGE
Visual Question Answering (VQA)VQA v2 test-devAccuracy71.16VL-BERTBASE
Visual Question Answering (VQA)VQA v2 test-stdoverall72.2VL-BERTLARGE
Image Retrieval with Multi-Modal QueryCommercialAdsDatasetADD(S) AUC86.27VL-BERT
Cross-Modal Information RetrievalCommercialAdsDatasetADD(S) AUC86.27VL-BERT
Cross-Modal RetrievalCommercialAdsDatasetADD(S) AUC86.27VL-BERT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17