TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Unifying Vision-and-Language Tasks via Text Generation

Unifying Vision-and-Language Tasks via Text Generation

Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

2021-02-04Question AnsweringText GenerationReferring ExpressionReferring Expression ComprehensionImage CaptioningMulti-Task LearningVisual Question Answering (VQA)Visual Commonsense ReasoningConditional Text GenerationLanguage ModellingVisual Question Answering
PaperPDFCodeCode(official)

Abstract

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VCR (Q-AR) testAccuracy58.9VL-T5
Visual Question Answering (VQA)VCR (QA-R) testAccuracy77.8VL-T5
Visual Question Answering (VQA)VCR (Q-A) testAccuracy75.3VL-T5
Image Captioningnocaps valCIDEr4.4VL-T5
Image Captioningnocaps valSPICE5.3VL-T5
Image CaptioningFlickr30k Captions testCIDEr2.6VL-T5
Image CaptioningFlickr30k Captions testSPICE2VL-T5

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17