InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

2024-07-03Image Comprehension Video Question Answering Video Understanding Temporal Relation Extraction Language Modelling Visual Question Answering

Paper PDF Code(official)

Abstract

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	9	InternLM-XC-2.5 (CoT)
Relation Extraction	Vinoground	Text Score	30.8	InternLM-XC-2.5 (CoT)
Relation Extraction	Vinoground	Video Score	28.4	InternLM-XC-2.5 (CoT)
Relation Extraction	Vinoground	Group Score	9.6	InternLM-XC-2.5
Relation Extraction	Vinoground	Text Score	28.8	InternLM-XC-2.5
Relation Extraction	Vinoground	Video Score	27.8	InternLM-XC-2.5
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	51.7	IXC-2.5-7B
Video Question Answering	TVBench	Average Accuracy	51.6	IXC-2.5 7B
Temporal Relation Extraction	Vinoground	Group Score	9	InternLM-XC-2.5 (CoT)
Temporal Relation Extraction	Vinoground	Text Score	30.8	InternLM-XC-2.5 (CoT)
Temporal Relation Extraction	Vinoground	Video Score	28.4	InternLM-XC-2.5 (CoT)
Temporal Relation Extraction	Vinoground	Group Score	9.6	InternLM-XC-2.5
Temporal Relation Extraction	Vinoground	Text Score	28.8	InternLM-XC-2.5
Temporal Relation Extraction	Vinoground	Video Score	27.8	InternLM-XC-2.5
Visual Question Answering	MM-Vet	GPT-4 score	51.7	IXC-2.5-7B

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Abstract

Results

Related Papers

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Abstract

Results

Related Papers