TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Qwen-VL: A Versatile Vision-Language Model for Understandi...

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou

2023-08-24Spatial ReasoningVisual LocalizationQuestion AnsweringVisual GroundingMMR totalChart Question AnsweringReferring Expression SegmentationNatural Language Visual GroundingImage CaptioningVisual Question Answering (VQA)FS-MEVQALanguage ModellingVisual Question Answering
PaperPDFCodeCode(official)

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)DocVQA testANLS0.9024Qwen-VL-Plus
Visual Question Answering (VQA)DocVQA testANLS0.651Qwen-VL
Visual Question Answering (VQA)DocVQA testANLS0.626Qwen-VL-Chat
Visual Question Answering (VQA)InfiMM-EvalAbductive44.39Qwen-VL-Chat
Visual Question Answering (VQA)InfiMM-EvalAnalogical30.42Qwen-VL-Chat
Visual Question Answering (VQA)InfiMM-EvalDeductive37.55Qwen-VL-Chat
Visual Question Answering (VQA)InfiMM-EvalOverall score37.39Qwen-VL-Chat
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)45.3Qwen-VL-Chat (Coordinates)
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)39.2Qwen-VL-Chat (Visual Prompt)
Visual Question Answering (VQA)ViP-BenchGPT-4 score (human)41.7Qwen-VL-Chat (Visual Prompt)
Visual Question Answering (VQA)EmbSpatial-BenchGeneration49.11Qwen-VL-Max
Visual Question Answering (VQA)SME#Learning Samples (N)16Qwen-VL-Max
Visual Question Answering (VQA)SMEACC40.33Qwen-VL-Max
Visual Question Answering (VQA)SMEBLEU-424.3Qwen-VL-Max
Visual Question Answering (VQA)SMECIDEr201.47Qwen-VL-Max
Visual Question Answering (VQA)SMEDetection1.05Qwen-VL-Max
Visual Question Answering (VQA)SMEMETEOR23.4Qwen-VL-Max
Visual Question Answering (VQA)SMEROUGE-L34.52Qwen-VL-Max
Visual Question Answering (VQA)SMESPICE26.13Qwen-VL-Max
Visual Question Answering (VQA)ChartQA1:1 Accuracy66.3Qwen-VL-Chat
Visual Question Answering (VQA)ChartQA1:1 Accuracy65.7Qwen-VL
Natural Language Visual GroundingScreenSpotAccuracy (%)5.2Qwen-VL
Chart Question AnsweringChartQA1:1 Accuracy66.3Qwen-VL-Chat
Chart Question AnsweringChartQA1:1 Accuracy65.7Qwen-VL
Visual Question AnsweringViP-BenchGPT-4 score (bbox)45.3Qwen-VL-Chat (Coordinates)
Visual Question AnsweringViP-BenchGPT-4 score (bbox)39.2Qwen-VL-Chat (Visual Prompt)
Visual Question AnsweringViP-BenchGPT-4 score (human)41.7Qwen-VL-Chat (Visual Prompt)
Visual Question AnsweringEmbSpatial-BenchGeneration49.11Qwen-VL-Max
Visual Question AnsweringSME#Learning Samples (N)16Qwen-VL-Max
Visual Question AnsweringSMEACC40.33Qwen-VL-Max
Visual Question AnsweringSMEBLEU-424.3Qwen-VL-Max
Visual Question AnsweringSMECIDEr201.47Qwen-VL-Max
Visual Question AnsweringSMEDetection1.05Qwen-VL-Max
Visual Question AnsweringSMEMETEOR23.4Qwen-VL-Max
Visual Question AnsweringSMEROUGE-L34.52Qwen-VL-Max
Visual Question AnsweringSMESPICE26.13Qwen-VL-Max
MMR totalMRR-BenchmarkTotal Column Score366Qwen-vl-max
MMR totalMRR-BenchmarkTotal Column Score310Qwen-vl-plus
Explanatory Visual Question AnsweringSME#Learning Samples (N)16Qwen-VL-Max
Explanatory Visual Question AnsweringSMEACC40.33Qwen-VL-Max
Explanatory Visual Question AnsweringSMEBLEU-424.3Qwen-VL-Max
Explanatory Visual Question AnsweringSMECIDEr201.47Qwen-VL-Max
Explanatory Visual Question AnsweringSMEDetection1.05Qwen-VL-Max
Explanatory Visual Question AnsweringSMEMETEOR23.4Qwen-VL-Max
Explanatory Visual Question AnsweringSMEROUGE-L34.52Qwen-VL-Max
Explanatory Visual Question AnsweringSMESPICE26.13Qwen-VL-Max

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17