TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SimVLM: Simple Visual Language Model Pretraining with Weak...

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

ZiRui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao

2021-08-24ICLR 2022 4Question AnsweringImage CaptioningVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCodeCode

Abstract

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy80.03SimVLM
Visual Question Answering (VQA)VQA v2 test-stdoverall80.34SimVLM
Visual ReasoningNLVR2 DevAccuracy84.53SimVLM
Visual ReasoningNLVR2 TestAccuracy85.15SimVLM
Natural Language InferenceSNLI-VE valAccuracy86.21SimVLM
Natural Language InferenceSNLI-VE testAccuracy86.32SimVLM
Image Captioningnocaps near-domainB184.36Single Model
Image Captioningnocaps near-domainB269.83Single Model
Image Captioningnocaps near-domainB352.42Single Model
Image Captioningnocaps near-domainB433.74Single Model
Image Captioningnocaps near-domainCIDEr110.76Single Model
Image Captioningnocaps near-domainMETEOR30.97Single Model
Image Captioningnocaps near-domainROUGE-L60.46Single Model
Image Captioningnocaps near-domainSPICE14.61Single Model
Image Captioningnocaps entireB183.78Single Model
Image Captioningnocaps entireB268.86Single Model
Image Captioningnocaps entireB351.06Single Model
Image Captioningnocaps entireB432.2Single Model
Image Captioningnocaps entireCIDEr110.31Single Model
Image Captioningnocaps entireMETEOR30.55Single Model
Image Captioningnocaps entireROUGE-L59.86Single Model
Image Captioningnocaps entireSPICE14.49Single Model
Image Captioningnocaps-val-out-domainCIDEr115.2SimVLM
Image Captioningnocaps-val-near-domainCIDEr110.9SimVLM
Image CaptioningCOCO CaptionsBLEU-440.6SimVLM
Image CaptioningCOCO CaptionsCIDER143.3SimVLM
Image CaptioningCOCO CaptionsMETEOR33.4SimVLM
Image CaptioningCOCO CaptionsSPICE25.4SimVLM
Image Captioningnocaps out-of-domainB180.89Single Model
Image Captioningnocaps out-of-domainB264.21Single Model
Image Captioningnocaps out-of-domainB344.38Single Model
Image Captioningnocaps out-of-domainB424.47Single Model
Image Captioningnocaps out-of-domainCIDEr109.49Single Model
Image Captioningnocaps out-of-domainMETEOR27.91Single Model
Image Captioningnocaps out-of-domainROUGE-L56.69Single Model
Image Captioningnocaps out-of-domainSPICE13.89Single Model
Image Captioningnocaps-val-overallCIDEr112.2SimVLM
Image Captioningnocaps in-domainB184.64Single Model
Image Captioningnocaps in-domainB270Single Model
Image Captioningnocaps in-domainB352.96Single Model
Image Captioningnocaps in-domainB434.66Single Model
Image Captioningnocaps in-domainCIDEr108.98Single Model
Image Captioningnocaps in-domainMETEOR31.97Single Model
Image Captioningnocaps in-domainROUGE-L61.01Single Model
Image Captioningnocaps in-domainSPICE14.6Single Model
Image Captioningnocaps-val-in-domainCIDEr113.7SimVLM

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17