TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering/MM-Vet

Visual Question Answering on MM-Vet

Metric: GPT-4 score (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕GPT-4 score▼Extra DataPaperDate↕Code
1MMCTAgent (GPT-4 + GPT-4V)74.24NoMMCTAgent: Multi-modal Critical Thinking Agent F...2024-05-28-
2Qwen2-VL-72B74NoQwen2-VL: Enhancing Vision-Language Model's Perc...2024-09-18Code
3InternVL2.5-78B72.3NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
4GPT-4o +text rationale +IoT72.2NoImage-of-Thought Prompting for Visual Reasoning ...2024-05-22-
5Lyra-Pro71.4NoLyra: An Efficient and Speech-Centric Framework ...2024-12-12Code
6GLM-4V-Plus71.1NoCogVLM2: Visual Language Models for Image and Vi...2024-08-29Code
7Phantom-7B70.8NoPhantom of Latent for Large Language and Vision ...2024-09-23Code
8InternVL2.5-38B68.8NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
9InternVL2-26B (SGP, token ratio 64%)65.6NoA Stitch in Time Saves Nine: Small VLM is a Prec...2024-12-04Code
10Baichuan-Omni (7B)65.4NoBaichuan-Omni Technical Report2024-10-11Code
11InternVL2.5-26B65NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
12Qwen2-VL-7B (finetuned on GAP-VQA train)64.954NoGamified crowd-sourcing of high-quality data for...2024-10-05-
13InternVL2-Llama3-76B64.4No---
14GLM4 Vision63.9NoCogVLM: Visual Expert for Pretrained Language Mo...2023-11-06Code
15LLaVA-OneVision-72B63.7NoLLaVA-OneVision: Easy Visual Task Transfer2024-08-06Code
16Lyra-Base63.5NoLyra: An Efficient and Speech-Centric Framework ...2024-12-12Code
17InternVL2-26B (SGP, token ratio 35%)63.2NoA Stitch in Time Saves Nine: Small VLM is a Prec...2024-12-04Code
18InternVL 1.562.8NoHow Far Are We to GPT-4V? Closing the Gap to Com...2024-04-25Code
19InternVL2.5-8B62.8NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
20MAmmoTH-VL-8B62.3NoMAmmoTH-VL: Eliciting Multimodal Reasoning with ...2024-12-06Code
21Qwen2-VL-7B62NoQwen2-VL: Enhancing Vision-Language Model's Perc...2024-09-18Code
22InternVL2-40B61.8No---
23Mini-Gemini-HD-BS60.8NoMini-Gemini: Mining the Potential of Multi-modal...2024-03-27Code
24InternVL2.5-2B60.8NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
25MAmmoTH-VL-8B (SI)60.6NoMAmmoTH-VL: Eliciting Multimodal Reasoning with ...2024-12-06Code
26InternVL2.5-4B60.6NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
27Mini-Gemini-HD59.3NoMini-Gemini: Mining the Potential of Multi-modal...2024-03-27Code
28GLM-4V-9B58NoCogVLM2: Visual Language Models for Image and Vi...2024-08-29Code
29LLaVA-OneVision-7B57.5NoLLaVA-OneVision: Easy Visual Task Transfer2024-08-06Code
30LLaVA-NeXT-34B57.4No---
31Meteor57.3NoMeteor: Mamba-based Traversal of Rationale for L...2024-05-24Code
32CROME (Vicuna-13B)55.1NoCROME: Cross-Modal Adapters for Efficient Multim...2024-08-13-
33IXC2-4KHD54.9NoInternLM-XComposer2-4KHD: A Pioneering Large Vis...2024-04-09Code
34Weitu-VL-1.054.7No---
35TroL-7B54.7NoTroL: Traversal of Layers for Large Language and...2024-06-18Code
36Mini-Gemini53NoMini-Gemini: Mining the Potential of Multi-modal...2024-03-27Code
37CogVLM(Vicuna-7B)52.8NoCogVLM: Visual Expert for Pretrained Language Mo...2023-11-06Code
38CogAgent52.8NoCogAgent: A Visual Language Model for GUI Agents2023-12-14Code
39Qwen2-VL-2B (finetuned on GAP-VQA train)52.43NoGamified crowd-sourcing of high-quality data for...2024-10-05-
40InternVL2-26B (SGP, token ratio 9%)52.1NoA Stitch in Time Saves Nine: Small VLM is a Prec...2024-12-04Code
41MM1.5-30B52NoMM1.5: Methods, Analysis & Insights from Multimo...2024-09-30-
42MiniCPM-Llama3-V-2.5-8B (finetuned on GAP-VQA train)51.789NoGamified crowd-sourcing of high-quality data for...2024-10-05-
43IXC-2.5-7B51.7NoInternLM-XComposer-2.5: A Versatile Large Vision...2024-07-03Code
44InternLM-XComposer251.2NoInternLM-XComposer2: Mastering Free-form Text-Im...2024-01-29Code
45Lyra-Mini51.2NoLyra: An Efficient and Speech-Centric Framework ...2024-12-12Code
46CuMo-7B51NoCuMo: Scaling Multimodal LLM with Co-Upcycled Mi...2024-05-09Code
47TACO (Qwen2-7B / SigLIP)50.9NoTACO: Learning Multi-modal Action Models with Sy...2024-12-07Code
48Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback))50.7NoVLFeedback: A Large-Scale AI Feedback Dataset fo...2024-10-12-
49POINTS-9B50NoPOINTS: Improving Your Vision-language Model wit...2024-09-07-
50VILA^2-8B50NoVILA$^2$: VILA Augmented VILA2024-07-24-
51Janus-Pro-7B50NoJanus-Pro: Unified Multimodal Understanding and ...2025-01-29Code
52Silkie49.9NoSilkie: Preference Distillation for Large Visual...2023-12-17-
53Silkie (Qwen-VL-Chat + DPO w/ VLFeedback)49.9NoVLFeedback: A Large-Scale AI Feedback Dataset fo...2024-10-12-
54Qwen2-VL-2B49.5NoQwen2-VL: Enhancing Vision-Language Model's Perc...2024-09-18Code
55FlashSloth-HD49NoFlashSloth: Lightning Multimodal Large Language ...2024-12-05Code
56InternVL 1.248.9NoHow Far Are We to GPT-4V? Closing the Gap to Com...2024-04-25Code
57SEA-PRIME (Vicuna-13B)48.8NoSEA: Supervised Embedding Alignment for Token-Le...2024-08-21-
58InternVL2.5-1B48.8NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
59MM1-30B-Chat48.7NoMM1: Methods, Analysis & Insights from Multimoda...2024-03-14-
60SETOKIM (13B)48.7NoTowards Semantic Equivalence of Tokenization in ...2024-06-07-
61Emu2-Chat48.5NoGenerative Multimodal Models are In-Context Lear...2023-12-20Code
62MG-LLaVA(34B)48.5NoMG-LLaVA: Towards Multi-Granularity Visual Instr...2024-06-25Code
63SPHINX-Plus47.9NoSPHINX-X: Scaling Data and Parameters for a Fami...2024-02-08Code
64ConvLLaVA45.9NoConvLLaVA: Hierarchical Backbones as Visual Enco...2024-05-24Code
65VILA-13B45.7NoVILA: On Pre-training for Visual Language Models2023-12-12Code
66TACO (LLaMA3-8B / SigLIP)45.7NoTACO: Learning Multi-modal Action Models with Sy...2024-12-07Code
67HPT 1.5 Edge45.3No---
68TACO (LLaMA3-8B / CLIP)45.2NoTACO: Learning Multi-modal Action Models with Sy...2024-12-07Code
69LLaVA-v1.6 (7B, w/ STIC)45NoEnhancing Large Vision Language Models with Self...2024-05-30Code
70H2OVL-Mississippi-2B44.7NoH2OVL-Mississippi Vision Language Models Technic...2024-10-17-
71PIIP-LLaVA (Vicuna-7B, ConvNeXt-L, CLIP-L )44.7NoParameter-Inverted Image Pyramid Networks for Vi...2025-01-14Code
72Imp-4B44.6NoImp: Highly Capable Large Multimodal Models for ...2024-05-20Code
73LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback)44.2NoVLFeedback: A Large-Scale AI Feedback Dataset fo...2024-10-12-
74MGM-7B+RP44.1NoImg-Diff: Contrastive Data Synthesis for Multimo...2024-08-08-
75LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback)44.1NoVLFeedback: A Large-Scale AI Feedback Dataset fo...2024-10-12-
76VW-LMM44NoMulti-modal Auto-regressive Modeling via Visual ...2024-03-12Code
77MoAI43.7NoMoAI: Mixture of All Intelligence for Large Lang...2024-03-12Code
78MM1-3B-Chat43.7NoMM1: Methods, Analysis & Insights from Multimoda...2024-03-14-
79MM1.5-3B-MoE43.7NoMM1.5: Methods, Analysis & Insights from Multimo...2024-09-30-
80Imp-3B43.3NoImp: Highly Capable Large Multimodal Models for ...2024-05-20Code
81ShareGPT4V-13B43.1NoShareGPT4V: Improving Large Multi-Modal Models w...2023-11-21Code
82Mini-Gemini (+MoCa)42.9NoDeciphering Cross-Modal Alignment in Large Visio...2024-10-09Code
83MM1.5-7B42.2NoMM1.5: Methods, Analysis & Insights from Multimo...2024-09-30-
84MM1-7B-Chat42.1NoMM1: Methods, Analysis & Insights from Multimoda...2024-03-14-
85FlashSloth41.9NoFlashSloth: Lightning Multimodal Large Language ...2024-12-05Code
86DeepSeek-VL41.5NoDeepSeek-VL: Towards Real-World Vision-Language ...2024-03-08Code
87LaVA1.5-13B-BPO41.4NoStrengthening Multimodal Large Language Model wi...2024-03-13-
88ASMv241.3NoThe All-Seeing Project V2: Towards General Relat...2024-02-29Code
89FocusLLaVA41.3NoFocusLLaVA: A Coarse-to-Fine Approach for Effici...2024-11-21-
90SeVa-13B41NoSelf-Supervised Visual Preference Alignment2024-04-16Code
91MM1.5-3B41NoMM1.5: Methods, Analysis & Insights from Multimo...2024-09-30-
92LLaVA-1.5-7B (VG-S)40.4NoProVision: Programmatically Scaling Vision-centr...2024-12-09Code
93CoLLaVO40.3NoCoLLaVO: Crayon Large Language and Vision mOdel2024-02-17Code
94SPHINX-2k40.2NoSPHINX: The Joint Mixing of Weights, Tasks, and ...2023-11-13Code
95LLaVA-1.5 (LVIS-Instrcut4V)40.2NoTo See is to Believe: Prompting GPT-4V for Bette...2023-11-13Code
96mPLUG-Owl340.1NomPLUG-Owl3: Towards Long Image-Sequence Understa...2024-08-09Code
97Mono-InternVL-2B40.1NoMono-InternVL: Pushing the Boundaries of Monolit...2024-10-10-
98LLaVA1.5-13B-MDA39.9NoLooking Beyond Text: Reducing Language bias in L...2024-11-21-
99LLaVA-VT (Vicuna-13B)39.8NoBeyond Embeddings: The Promise of Visual Table i...2024-03-27Code
100MM1.5-1B-MoE39.8NoMM1.5: Methods, Analysis & Insights from Multimo...2024-09-30-
101Janus-Pro-1B39.8NoJanus-Pro: Unified Multimodal Understanding and ...2025-01-29Code
102SQ-LLaVA∗39.7NoSQ-LLaVA: Self-Questioning for Large Vision-Lang...2024-03-17Code
103OmniFusion (grid split + ruDocVQA)39.4NoOmniFusion Technical Report2024-04-09-
104DeepStack-L-HD (Vicuna-13B)39.3NoDeepStack: Deeply Stacking Visual Tokens is Surp...2024-06-06-
105LAF-13B38.9NoFrom Training-Free to Adaptive: Empirical Insigh...2024-01-31-
106InfiMM-HD38.9NoInfiMM-HD: A Leap Forward in High-Resolution Mul...2024-03-03-
107InternLM-XC2 + MMDU-45k38.8NoMMDU: A Multi-Turn Multi-Image Dialog Understand...2024-06-17Code
108LLaVA-1.5-7B (DC-S)38.5NoProVision: Programmatically Scaling Vision-centr...2024-12-09Code
109LayoutLMv3+ConvNeXt+CLIP38.4NoMouSi: Poly-Visual-Expert Vision-Language Models2024-01-30Code
110VOLCANO 13B38NoVolcano: Mitigating Multimodal Hallucination thr...2023-11-13Code
111LLaVA-1.5+MMInstruct (Vicuna-13B)37.9NoMMInstruct: A High-Quality Multi-Modal Instructi...2024-07-22Code
112LLaVA-1.5-13B (+CSR)37.8NoCalibrated Self-Rewarding Vision Language Models2024-05-23Code
113LLaVA-1.5-LLaMA3-8B37.8NoWhat If We Recaption Billions of Web Images with...2024-06-12-
114LLaVA-1.5 + DenseFusion-1M (Vicuna-7B)37.8NoDenseFusion-1M: Merging Vision Experts for Compr...2024-07-11Code
115ShareGPT4V-7B37.6NoShareGPT4V: Improving Large Multi-Modal Models w...2023-11-21Code
116LLaVA-1.5+CoS37.6NoChain-of-Spot: Interactive Reasoning Improves La...2024-03-19Code
117LLaVA-COCO-13B37.5NoCOCO is "ALL'' You Need for Visual Instruction F...2024-01-17-
118LLaVA-S^2 + DenseFusion-1M (Vicuna-7B)37.5NoDenseFusion-1M: Merging Vision Experts for Compr...2024-07-11Code
119MM1.5-1B37.4NoMM1.5: Methods, Analysis & Insights from Multimo...2024-09-30-
120Dynamic-LLaVA-13B37.3NoDynamic-LLaVA: Efficient Multimodal Large Langua...2024-12-01Code
121SeVa-7B37.2NoSelf-Supervised Visual Preference Alignment2024-04-16Code
122SoM-LLaVA-1.5-T37.2NoList Items One by One: A New Data Source and Lea...2024-04-25Code
123Emu337.2NoEmu3: Next-Token Prediction is All You Need2024-09-27Code
124LLaVA-Instruct (Vicuna-1.5-13B)37.1NoMM-Instruct: Generated Visual Instructions for L...2024-06-28Code
125ILLUME37NoILLUME: Illuminating Your LLMs to See, Draw, and...2024-12-09-
126LLaVA1.5-7B-BPO36.8NoStrengthening Multimodal Large Language Model wi...2024-03-13-
127LLaVA-1.5-13B (+ MMFuser)36.6NoMMFuser: Multimodal Multi-Layer Feature Fuser fo...2024-10-15Code
128CaMML-13B36.4NoCaMML: Context-Aware Multimodal Learner for Larg...2024-01-06Code
129LLaVA-65B (Data Mixing)36.4NoAn Empirical Study of Scaling Instruct-Tuned Lar...2023-09-18Code
130Vary-base36.2NoVary: Scaling up the Vision Vocabulary for Large...2023-12-11Code
131StableLLaVA36.1NoStableLLaVA: Enhanced Visual Instruction Tuning ...2023-08-20Code
132DreamLLM-7B35.9NoDreamLLM: Synergistic Multimodal Comprehension a...2023-09-20Code
133MoE-LLaVA-2.7B×4-Top235.9NoMoE-LLaVA: Mixture of Experts for Large Vision-L...2024-01-29Code
134SoM-LLaVA-1.535.9NoList Items One by One: A New Data Source and Lea...2024-04-25Code
135Dragonfly (Llama3-8B)35.9NoDragonfly: Multi-Resolution Zoom-In Encoding Enh...2024-06-03Code
136Ferret-v2-13B35.7NoFerret-v2: An Improved Baseline for Referring an...2024-04-11Code
137AlignGPT (Vicuna-13B)35.6NoAlignGPT: Multi-modal Large Language Models with...2024-05-23-
138LLaVA-HR-X35.5NoFeast Your Eyes: Mixture-of-Resolution Adaptatio...2024-03-05Code
139SQ-LLaVA35.5NoSQ-LLaVA: Self-Questioning for Large Vision-Lang...2024-03-17Code
140LOVA$^3$35.2NoLOVA3: Learning to Visual Question Answering, As...2024-05-23Code
141LLaVA-InternLM2-7B-ViT + MoSLoRA35.2NoMixture-of-Subspaces in Low-Rank Adaptation2024-06-16Code
142InternLM2+ViT (QMoSLoRA)35.2NoMixture-of-Subspaces in Low-Rank Adaptation2024-06-16Code
143LLaVA1.5-7B-MDA35.2NoLooking Beyond Text: Reducing Language bias in L...2024-11-21-
144Mipha-3B+35.1NoRethinking Visual Prompting for Multimodal Large...2024-07-05-
145Merlin34.9NoMerlin:Empowering Multimodal LLMs with Foresight...2023-11-30-
146Arcana34.8NoImproving Multi-modal Large Language Model throu...2024-10-17-
147INF-LLaVA34.5NoINF-LLaVA: Dual-perspective Perception for High-...2024-07-23Code
148LLaVA-1.5+MMInstruct (Vicuna-7B)34.4NoMMInstruct: A High-Quality Multi-Modal Instructi...2024-07-22Code
149Janus34.3NoJanus: Decoupling Visual Encoding for Unified Mu...2024-10-17Code
150LLaVA-TokenPacker (Vicuna-13B)34.1NoTokenPacker: Efficient Visual Projector for Mult...2024-07-02Code
151γ-MoD-LLaVA-HR34No$γ-$MoD: Exploring Mixture-of-Depth Adaptation f...2024-10-17-
152LLaVA-1.5-7B (CSR)33.9NoCalibrated Self-Rewarding Vision Language Models2024-05-23Code
153DynMOE-LLaVA33.6NoDynamic Mixture of Experts: An Auto-Tuning Appro...2024-05-23Code
154Imp-2B33.5NoImp: Highly Capable Large Multimodal Models for ...2024-05-20Code
155InfMLLM-7B-Chat33.4NoInfMLLM: A Unified Framework for Visual-Language...2023-11-12Code
156Video-LaVIT33.2NoVideo-LaVIT: Unified Video-Language Pre-training...2024-02-05Code
157LLaVA-Instruct (Vicuna-1.5-7B)32.9NoMM-Instruct: Generated Visual Instructions for L...2024-06-28Code
158VisionZip (Retain 128 Tokens, fine-tuning)32.9NoVisionZip: Longer is Better but Not Necessary in...2024-12-05Code
159Uni-MoE32.8NoUni-MoE: Scaling Unified Multimodal LLMs with Mi...2024-05-18Code
160VL-Mamba (Mamba LLM-2.8B)32.6NoVL-Mamba: Exploring State Space Models for Multi...2024-03-20-
161LLaVA-v1.5 (7B, w/ STIC)32.6NoEnhancing Large Vision Language Models with Self...2024-05-30Code
162VisionZip (Retain 192 Tokens, fine-tuning)32.6NoVisionZip: Longer is Better but Not Necessary in...2024-12-05Code
163VisionZip (Retain 128 Tokens)32.6NoVisionZip: Longer is Better but Not Necessary in...2024-12-05Code
164LLaVA-v1.5 (+MoCa)32.2NoDeciphering Cross-Modal Alignment in Large Visio...2024-10-09Code
165Dynamic-LLaVA-7B32.2NoDynamic-LLaVA: Efficient Multimodal Large Langua...2024-12-01Code
166Mipha-3B32.1NoMipha: A Comprehensive Overhaul of Multimodal As...2024-03-10Code
167VOLCANO 7B32NoVolcano: Mitigating Multimodal Hallucination thr...2023-11-13Code
168Video-LLaVA32NoVideo-LLaVA: Learning United Visual Representati...2023-11-16Code
169TinyLLaVA-share-Sig-Ph32NoTinyLLaVA: A Framework of Small-scale Large Mult...2024-02-22Code
170LLaVA-VT (Vicuna-7B)31.8NoBeyond Embeddings: The Promise of Visual Table i...2024-03-27Code
171VisionZip (Retain 192 Tokens)31.7NoVisionZip: Longer is Better but Not Necessary in...2024-12-05Code
172VisionZip (Retain 64 Tokens)31.7NoVisionZip: Longer is Better but Not Necessary in...2024-12-05Code
173LLaVA-1.5-7B (+ SIMA)31.6NoEnhancing Visual-Language Modality Alignment in ...2024-05-24Code
174MiCo-Chat-7B31.4NoExplore the Limits of Omni-modal Pretraining at ...2024-06-13Code
175LLaVA-1.5-7B + TeamLoRA31.2NoTeamLoRA: Boosting Low-Rank Adaptation with Expe...2024-08-19Code
176RoboCodeX-13B31NoRoboCodeX: Multimodal Code Generation for Roboti...2024-02-25-
177HyperLLaVA31NoHyperLLaVA: Dynamic Visual and Language Expert T...2024-03-20Code
178FAST (Vicuna-7B)31NoVisual Agents as Fast and Slow Thinkers2024-08-16Code
179JanusFlow30.9NoJanusFlow: Harmonizing Autoregression and Rectif...2024-11-12Code
180AlignGPT (Vicuna-7B)30.8NoAlignGPT: Multi-modal Large Language Models with...2024-05-23-
181LLaVolta30.7NoEfficient Large Multi-modal Models via Visual Co...2024-06-28Code
182LLaVA-AlignedVQ30.7NoAligned Vector Quantization for Edge-Cloud Colla...2024-11-08-
183LLaVA-1.5-HACL30.4NoHallucination Augmented Contrastive Learning for...2023-12-12Code
184MaVEn30.4NoMaVEn: An Effective Multi-granularity Hybrid Vis...2024-08-22-
185VisionZip (Retain 64 Tokens, fine-tuning)30.2NoVisionZip: Longer is Better but Not Necessary in...2024-12-05Code
186H2OVL-Mississippi-0.8B30NoH2OVL-Mississippi Vision Language Models Technic...2024-10-17-
187RoboMamba29.7NoRoboMamba: Efficient Vision-Language-Action Mode...2024-06-06-
188LLaVA-TokenPacker (Vicuna-7B)29.6NoTokenPacker: Efficient Visual Projector for Mult...2024-07-02Code
189OneLLM-7B29.1NoOneLLM: One Framework to Align All Modalities wi...2023-12-06Code
190LLaVA-OneVision-0.5B29.1NoLLaVA-OneVision: Easy Visual Task Transfer2024-08-06Code
191Vary-toy29NoSmall Language Model Meets with Reinforced Visio...2024-01-23-
192LLaVA-Phi28.9NoLLaVA-Phi: Efficient Multi-Modal Assistant with ...2024-01-04Code
193MMAR-7B27.8NoMMAR: Towards Lossless Multi-Modal Auto-Regressi...2024-10-14-
194SEAL (7B)27.7NoV*: Guided Visual Search as a Core Mechanism in ...2023-12-21Code
195OtterHD-8B26.3NoOtterHD: A High-Resolution Multi-modality Model2023-11-07Code
196TGA-7B25.6NoCross-Modal Safety Mechanism Transfer in Large V...2024-10-16-
197LinVT23.5NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code
198Xmodel-VLM (Xmodel-LM 1.1B)21.8NoXmodel-VLM: A Simple Baseline for Multimodal Vis...2024-05-15Code
199TextBind19.4NoTextBind: Multi-turn Interleaved Multimodal Inst...2023-09-14Code
200MMAR-0.5B18.49NoMMAR: Towards Lossless Multi-Modal Auto-Regressi...2024-10-14-