mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

2024-08-09Video Question Answering Large Language Model Visual Question Answering (VQA)Language Modelling Visual Question Answering

Paper PDF Code(official)

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	VLM2-Bench	Average Score on VLM2-bench (9 subtasks)	37.85	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	GC-mat	17.37	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	GC-trk	18.26	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-cnt	62.97	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-cpr	49.17	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-grp	31	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-VID	13.5	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-cnt	58.86	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-cpr	63.5	mPLUG-Owl3-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-grp	26	mPLUG-Owl3-7B
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	40.1	mPLUG-Owl3
Video Question Answering	TVBench	Average Accuracy	42.2	mPLUG-Owl3
Video Question Answering	NExT-QA	Accuracy	78.6	mPLUG-Owl3(8B)
Video Question Answering	MVBench	Avg.	59.5	mPLUG-Owl3(7B)
Visual Question Answering	MM-Vet	GPT-4 score	40.1	mPLUG-Owl3

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Abstract

Results

Related Papers

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Abstract

Results

Related Papers