TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UNITER: UNiversal Image-TExt Representation Learning

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

2019-09-25ECCV 2020 8Question AnsweringZero-Shot Cross-Modal RetrievalImage-text RetrievalReferring ExpressionRepresentation LearningImage-text matchingText MatchingText RetrievalMasked Language ModelingVisual EntailmentReferring Expression ComprehensionVisual ReasoningRetrievalVisual Question Answering (VQA)Visual Commonsense ReasoningLanguage ModellingVisual Question Answering
PaperPDFCodeCodeCodeCodeCode(official)CodeCode

Abstract

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VCR (Q-AR) testAccuracy62.8UNITER (Large)
Visual Question Answering (VQA)VCR (QA-R) testAccuracy83.4UNITER-large (ensemble of 10 models)
Visual Question Answering (VQA)VCR (QA-R) testAccuracy80.8UNITER (Large)
Visual Question Answering (VQA)VCR (Q-A) testAccuracy79.8UNITER-large (10 ensemble)
Visual Question Answering (VQA)VCR (Q-A) testAccuracy77.3UNITER (Large)
Visual Question Answering (VQA)VQA v2 test-devAccuracy73.24UNITER (Large)
Visual Question Answering (VQA)VQA v2 test-stdoverall73.4UNITER (Large)
Visual ReasoningNLVR2 TestAccuracy79.5UNITER (Large)
Natural Language InferenceSNLI-VE valAccuracy78.98UNITER
Natural Language InferenceSNLI-VE testAccuracy78.98UNITER (Large)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@180.7UNITER
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@595.7UNITER
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@166.2UNITER
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1092.9UNITER
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@588.4UNITER

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17