Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin

2024-09-18Zero-Shot Video Question Answer Natural Language Visual Grounding Video Question Answering Visual Question Answering (VQA)Temporal Relation Extraction Visual Question Answering

Paper PDF Code Code Code Code Code(official)Code Code Code

Abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	17.4	Qwen2-VL-72B
Relation Extraction	Vinoground	Text Score	50.4	Qwen2-VL-72B
Relation Extraction	Vinoground	Video Score	32.6	Qwen2-VL-72B
Relation Extraction	Vinoground	Group Score	15.2	Qwen2-VL-7B
Relation Extraction	Vinoground	Text Score	40.2	Qwen2-VL-7B
Relation Extraction	Vinoground	Video Score	32.4	Qwen2-VL-7B
Question Answering	VNBench	Accuracy	33.9	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	Average Score on VLM2-bench (9 subtasks)	42.37	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	GC-mat	27.8	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	GC-trk	19.18	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-cnt	45.99	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-cpr	68.06	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-grp	35	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-VID	16.25	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-cnt	58.59	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-cpr	61.5	Qwen2-VL-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-grp	49	Qwen2-VL-7B
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	74	Qwen2-VL-72B
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	62	Qwen2-VL-7B
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	49.5	Qwen2-VL-2B
Video Question Answering	OVBench	AVG	49.7	Qwen2-VL (7B)
Video Question Answering	TVBench	Average Accuracy	52.7	Qwen2-VL-72B
Video Question Answering	TVBench	Average Accuracy	43.8	Qwen2-VL-7B
Video Question Answering	NExT-QA	Accuracy	81.2	Qwen2-VL(7B)
Video Question Answering	VNBench	Accuracy	33.9	Qwen2-VL-7B
Natural Language Visual Grounding	ScreenSpot	Accuracy (%)	42.1	Qwen2-VL-7B
Temporal Relation Extraction	Vinoground	Group Score	17.4	Qwen2-VL-72B
Temporal Relation Extraction	Vinoground	Text Score	50.4	Qwen2-VL-72B
Temporal Relation Extraction	Vinoground	Video Score	32.6	Qwen2-VL-72B
Temporal Relation Extraction	Vinoground	Group Score	15.2	Qwen2-VL-7B
Temporal Relation Extraction	Vinoground	Text Score	40.2	Qwen2-VL-7B
Temporal Relation Extraction	Vinoground	Video Score	32.4	Qwen2-VL-7B
Visual Question Answering	MM-Vet	GPT-4 score	74	Qwen2-VL-72B
Visual Question Answering	MM-Vet	GPT-4 score	62	Qwen2-VL-7B
Visual Question Answering	MM-Vet	GPT-4 score	49.5	Qwen2-VL-2B

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Abstract

Results

Related Papers

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Abstract

Results

Related Papers