TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/mPLUG-Owl2: Revolutionizing Multi-modal Large Language Mod...

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

2023-11-07CVPR 2024 1Long-Context UnderstandingLarge Language ModelVisual Question Answering (VQA)1 Image, 2*2 StitchingLanguage ModellingVisual Question Answering
PaperPDFCode(official)Code(official)

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)InfiMM-EvalAbductive20.6mPLUG-Owl2
Visual Question Answering (VQA)InfiMM-EvalAnalogical7.64mPLUG-Owl2
Visual Question Answering (VQA)InfiMM-EvalDeductive23.43mPLUG-Owl2
Visual Question Answering (VQA)InfiMM-EvalOverall score20.05mPLUG-Owl2
Long-Context UnderstandingMMNeedle1 Image, 2*2 Stitching, Exact Accuracy1.9mPLUG-Owl-v2
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy0.3mPLUG-Owl-v2
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy0.7mPLUG-Owl-v2
Long-Context UnderstandingMMNeedle10 Images, 1*1 Stitching, Exact Accuracy0.4mPLUG-Owl-v2
Long-Context UnderstandingMMNeedle10 Images, 2*2 Stitching, Exact Accuracy0.1mPLUG-Owl-v2

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17