TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Qwen2-VL: Enhancing Vision-Language Model's Perception of ...

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin

2024-09-18Zero-Shot Video Question AnswerNatural Language Visual GroundingVideo Question AnsweringVisual Question Answering (VQA)Temporal Relation ExtractionVisual Question Answering
PaperPDFCodeCodeCodeCodeCode(official)CodeCodeCode

Abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score17.4Qwen2-VL-72B
Relation ExtractionVinogroundText Score50.4Qwen2-VL-72B
Relation ExtractionVinogroundVideo Score32.6Qwen2-VL-72B
Relation ExtractionVinogroundGroup Score15.2Qwen2-VL-7B
Relation ExtractionVinogroundText Score40.2Qwen2-VL-7B
Relation ExtractionVinogroundVideo Score32.4Qwen2-VL-7B
Question AnsweringVNBenchAccuracy33.9Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchAverage Score on VLM2-bench (9 subtasks)42.37Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchGC-mat27.8Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchGC-trk19.18Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchOC-cnt45.99Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchOC-cpr68.06Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchOC-grp35Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchPC-VID16.25Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchPC-cnt58.59Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchPC-cpr61.5Qwen2-VL-7B
Visual Question Answering (VQA)VLM2-BenchPC-grp49Qwen2-VL-7B
Visual Question Answering (VQA)MM-VetGPT-4 score74Qwen2-VL-72B
Visual Question Answering (VQA)MM-VetGPT-4 score62Qwen2-VL-7B
Visual Question Answering (VQA)MM-VetGPT-4 score49.5Qwen2-VL-2B
Video Question AnsweringOVBenchAVG49.7Qwen2-VL (7B)
Video Question AnsweringTVBenchAverage Accuracy52.7Qwen2-VL-72B
Video Question AnsweringTVBenchAverage Accuracy43.8Qwen2-VL-7B
Video Question AnsweringNExT-QAAccuracy81.2Qwen2-VL(7B)
Video Question AnsweringVNBenchAccuracy33.9Qwen2-VL-7B
Natural Language Visual GroundingScreenSpotAccuracy (%)42.1Qwen2-VL-7B
Temporal Relation ExtractionVinogroundGroup Score17.4Qwen2-VL-72B
Temporal Relation ExtractionVinogroundText Score50.4Qwen2-VL-72B
Temporal Relation ExtractionVinogroundVideo Score32.6Qwen2-VL-72B
Temporal Relation ExtractionVinogroundGroup Score15.2Qwen2-VL-7B
Temporal Relation ExtractionVinogroundText Score40.2Qwen2-VL-7B
Temporal Relation ExtractionVinogroundVideo Score32.4Qwen2-VL-7B
Visual Question AnsweringMM-VetGPT-4 score74Qwen2-VL-72B
Visual Question AnsweringMM-VetGPT-4 score62Qwen2-VL-7B
Visual Question AnsweringMM-VetGPT-4 score49.5Qwen2-VL-2B

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling2025-07-08