TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/mPLUG-Owl3: Towards Long Image-Sequence Understanding in M...

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

2024-08-09Video Question AnsweringLarge Language ModelVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCode(official)

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VLM2-BenchAverage Score on VLM2-bench (9 subtasks)37.85mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchGC-mat17.37mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchGC-trk18.26mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchOC-cnt62.97mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchOC-cpr49.17mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchOC-grp31mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchPC-VID13.5mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchPC-cnt58.86mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchPC-cpr63.5mPLUG-Owl3-7B
Visual Question Answering (VQA)VLM2-BenchPC-grp26mPLUG-Owl3-7B
Visual Question Answering (VQA)MM-VetGPT-4 score40.1mPLUG-Owl3
Video Question AnsweringTVBenchAverage Accuracy42.2mPLUG-Owl3
Video Question AnsweringNExT-QAAccuracy78.6mPLUG-Owl3(8B)
Video Question AnsweringMVBenchAvg.59.5mPLUG-Owl3(7B)
Visual Question AnsweringMM-VetGPT-4 score40.1mPLUG-Owl3

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17