TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VLMo: Unified Vision-Language Pre-Training with Mixture-of...

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Furu Wei

2021-11-03Image-text RetrievalText RetrievalVisual ReasoningRetrievalVisual Question Answering (VQA)Image Retrieval
PaperPDFCode(official)Code

Abstract

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at https://aka.ms/vlmo.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy82.78VLMo
Visual Question Answering (VQA)VQA v2 test-stdnumber67.26VLMo
Visual Question Answering (VQA)VQA v2 test-stdother72.87VLMo
Visual Question Answering (VQA)VQA v2 test-stdoverall81.3VLMo
Visual Question Answering (VQA)VQA v2 test-stdyes/no94.68VLMo
Visual ReasoningNLVR2 DevAccuracy85.64VLMo
Visual ReasoningNLVR2 TestAccuracy86.86VLMo
Image RetrievalPhotoChatR111.5VLMo
Image RetrievalPhotoChatR@1039.4VLMo
Image RetrievalPhotoChatR@530VLMo
Image RetrievalPhotoChatSum(R@1,5,10)83.2VLMo
RetrievalImage-ChatR@146.8VLMo
RetrievalImage-ChatR@567.5VLMo
RetrievalImage-ChatSum(R@1,5)114.3VLMo

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16