Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | DocVQA test | ANLS | 0.9024 | Qwen-VL-Plus |
| Visual Question Answering (VQA) | DocVQA test | ANLS | 0.651 | Qwen-VL |
| Visual Question Answering (VQA) | DocVQA test | ANLS | 0.626 | Qwen-VL-Chat |
| Visual Question Answering (VQA) | InfiMM-Eval | Abductive | 44.39 | Qwen-VL-Chat |
| Visual Question Answering (VQA) | InfiMM-Eval | Analogical | 30.42 | Qwen-VL-Chat |
| Visual Question Answering (VQA) | InfiMM-Eval | Deductive | 37.55 | Qwen-VL-Chat |
| Visual Question Answering (VQA) | InfiMM-Eval | Overall score | 37.39 | Qwen-VL-Chat |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (bbox) | 45.3 | Qwen-VL-Chat (Coordinates) |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (bbox) | 39.2 | Qwen-VL-Chat (Visual Prompt) |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (human) | 41.7 | Qwen-VL-Chat (Visual Prompt) |
| Visual Question Answering (VQA) | EmbSpatial-Bench | Generation | 49.11 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | #Learning Samples (N) | 16 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | ACC | 40.33 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | BLEU-4 | 24.3 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | CIDEr | 201.47 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | Detection | 1.05 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | METEOR | 23.4 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | ROUGE-L | 34.52 | Qwen-VL-Max |
| Visual Question Answering (VQA) | SME | SPICE | 26.13 | Qwen-VL-Max |
| Visual Question Answering (VQA) | ChartQA | 1:1 Accuracy | 66.3 | Qwen-VL-Chat |
| Visual Question Answering (VQA) | ChartQA | 1:1 Accuracy | 65.7 | Qwen-VL |
| Natural Language Visual Grounding | ScreenSpot | Accuracy (%) | 5.2 | Qwen-VL |
| Chart Question Answering | ChartQA | 1:1 Accuracy | 66.3 | Qwen-VL-Chat |
| Chart Question Answering | ChartQA | 1:1 Accuracy | 65.7 | Qwen-VL |
| Visual Question Answering | ViP-Bench | GPT-4 score (bbox) | 45.3 | Qwen-VL-Chat (Coordinates) |
| Visual Question Answering | ViP-Bench | GPT-4 score (bbox) | 39.2 | Qwen-VL-Chat (Visual Prompt) |
| Visual Question Answering | ViP-Bench | GPT-4 score (human) | 41.7 | Qwen-VL-Chat (Visual Prompt) |
| Visual Question Answering | EmbSpatial-Bench | Generation | 49.11 | Qwen-VL-Max |
| Visual Question Answering | SME | #Learning Samples (N) | 16 | Qwen-VL-Max |
| Visual Question Answering | SME | ACC | 40.33 | Qwen-VL-Max |
| Visual Question Answering | SME | BLEU-4 | 24.3 | Qwen-VL-Max |
| Visual Question Answering | SME | CIDEr | 201.47 | Qwen-VL-Max |
| Visual Question Answering | SME | Detection | 1.05 | Qwen-VL-Max |
| Visual Question Answering | SME | METEOR | 23.4 | Qwen-VL-Max |
| Visual Question Answering | SME | ROUGE-L | 34.52 | Qwen-VL-Max |
| Visual Question Answering | SME | SPICE | 26.13 | Qwen-VL-Max |
| MMR total | MRR-Benchmark | Total Column Score | 366 | Qwen-vl-max |
| MMR total | MRR-Benchmark | Total Column Score | 310 | Qwen-vl-plus |
| Explanatory Visual Question Answering | SME | #Learning Samples (N) | 16 | Qwen-VL-Max |
| Explanatory Visual Question Answering | SME | ACC | 40.33 | Qwen-VL-Max |
| Explanatory Visual Question Answering | SME | BLEU-4 | 24.3 | Qwen-VL-Max |
| Explanatory Visual Question Answering | SME | CIDEr | 201.47 | Qwen-VL-Max |
| Explanatory Visual Question Answering | SME | Detection | 1.05 | Qwen-VL-Max |
| Explanatory Visual Question Answering | SME | METEOR | 23.4 | Qwen-VL-Max |
| Explanatory Visual Question Answering | SME | ROUGE-L | 34.52 | Qwen-VL-Max |
| Explanatory Visual Question Answering | SME | SPICE | 26.13 | Qwen-VL-Max |