Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | VLM2-Bench | Average Score on VLM2-bench (9 subtasks) | 37.85 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | GC-mat | 17.37 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | GC-trk | 18.26 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-cnt | 62.97 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-cpr | 49.17 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-grp | 31 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-VID | 13.5 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-cnt | 58.86 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-cpr | 63.5 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-grp | 26 | mPLUG-Owl3-7B |
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 40.1 | mPLUG-Owl3 |
| Video Question Answering | TVBench | Average Accuracy | 42.2 | mPLUG-Owl3 |
| Video Question Answering | NExT-QA | Accuracy | 78.6 | mPLUG-Owl3(8B) |
| Video Question Answering | MVBench | Avg. | 59.5 | mPLUG-Owl3(7B) |
| Visual Question Answering | MM-Vet | GPT-4 score | 40.1 | mPLUG-Owl3 |