TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/InternLM-XComposer-2.5: A Versatile Large Vision Language ...

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

2024-07-03Image ComprehensionVideo Question AnsweringVideo UnderstandingTemporal Relation ExtractionLanguage ModellingVisual Question Answering
PaperPDFCode(official)

Abstract

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score9InternLM-XC-2.5 (CoT)
Relation ExtractionVinogroundText Score30.8InternLM-XC-2.5 (CoT)
Relation ExtractionVinogroundVideo Score28.4InternLM-XC-2.5 (CoT)
Relation ExtractionVinogroundGroup Score9.6InternLM-XC-2.5
Relation ExtractionVinogroundText Score28.8InternLM-XC-2.5
Relation ExtractionVinogroundVideo Score27.8InternLM-XC-2.5
Visual Question Answering (VQA)MM-VetGPT-4 score51.7IXC-2.5-7B
Video Question AnsweringTVBenchAverage Accuracy51.6IXC-2.5 7B
Temporal Relation ExtractionVinogroundGroup Score9InternLM-XC-2.5 (CoT)
Temporal Relation ExtractionVinogroundText Score30.8InternLM-XC-2.5 (CoT)
Temporal Relation ExtractionVinogroundVideo Score28.4InternLM-XC-2.5 (CoT)
Temporal Relation ExtractionVinogroundGroup Score9.6InternLM-XC-2.5
Temporal Relation ExtractionVinogroundText Score28.8InternLM-XC-2.5
Temporal Relation ExtractionVinogroundVideo Score27.8InternLM-XC-2.5
Visual Question AnsweringMM-VetGPT-4 score51.7IXC-2.5-7B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16