Visual Question Answering on MM-Vet

Metric: GPT-4 score (higher is better)

LeaderboardDataset

Loading chart...

Results

#	Model↕	GPT-4 score▼	Extra Data	Paper	Date↕	Code
1	MMCTAgent (GPT-4 + GPT-4V)	74.24	No	MMCTAgent: Multi-modal Critical Thinking Agent F...	2024-05-28	-
2	Qwen2-VL-72B	74	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
3	InternVL2.5-78B	72.3	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
4	GPT-4o +text rationale +IoT	72.2	No	Image-of-Thought Prompting for Visual Reasoning ...	2024-05-22	-
5	Lyra-Pro	71.4	No	Lyra: An Efficient and Speech-Centric Framework ...	2024-12-12	Code
6	GLM-4V-Plus	71.1	No	CogVLM2: Visual Language Models for Image and Vi...	2024-08-29	Code
7	Phantom-7B	70.8	No	Phantom of Latent for Large Language and Vision ...	2024-09-23	Code
8	InternVL2.5-38B	68.8	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
9	InternVL2-26B (SGP, token ratio 64%)	65.6	No	A Stitch in Time Saves Nine: Small VLM is a Prec...	2024-12-04	Code
10	Baichuan-Omni (7B)	65.4	No	Baichuan-Omni Technical Report	2024-10-11	Code
11	InternVL2.5-26B	65	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
12	Qwen2-VL-7B (finetuned on GAP-VQA train)	64.954	No	Gamified crowd-sourcing of high-quality data for...	2024-10-05	-
13	InternVL2-Llama3-76B	64.4	No	-	-	-
14	GLM4 Vision	63.9	No	CogVLM: Visual Expert for Pretrained Language Mo...	2023-11-06	Code
15	LLaVA-OneVision-72B	63.7	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
16	Lyra-Base	63.5	No	Lyra: An Efficient and Speech-Centric Framework ...	2024-12-12	Code
17	InternVL2-26B (SGP, token ratio 35%)	63.2	No	A Stitch in Time Saves Nine: Small VLM is a Prec...	2024-12-04	Code
18	InternVL 1.5	62.8	No	How Far Are We to GPT-4V? Closing the Gap to Com...	2024-04-25	Code
19	InternVL2.5-8B	62.8	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
20	MAmmoTH-VL-8B	62.3	No	MAmmoTH-VL: Eliciting Multimodal Reasoning with ...	2024-12-06	Code
21	Qwen2-VL-7B	62	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
22	InternVL2-40B	61.8	No	-	-	-
23	Mini-Gemini-HD-BS	60.8	No	Mini-Gemini: Mining the Potential of Multi-modal...	2024-03-27	Code
24	InternVL2.5-2B	60.8	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
25	MAmmoTH-VL-8B (SI)	60.6	No	MAmmoTH-VL: Eliciting Multimodal Reasoning with ...	2024-12-06	Code
26	InternVL2.5-4B	60.6	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
27	Mini-Gemini-HD	59.3	No	Mini-Gemini: Mining the Potential of Multi-modal...	2024-03-27	Code
28	GLM-4V-9B	58	No	CogVLM2: Visual Language Models for Image and Vi...	2024-08-29	Code
29	LLaVA-OneVision-7B	57.5	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
30	LLaVA-NeXT-34B	57.4	No	-	-	-
31	Meteor	57.3	No	Meteor: Mamba-based Traversal of Rationale for L...	2024-05-24	Code
32	CROME (Vicuna-13B)	55.1	No	CROME: Cross-Modal Adapters for Efficient Multim...	2024-08-13	-
33	IXC2-4KHD	54.9	No	InternLM-XComposer2-4KHD: A Pioneering Large Vis...	2024-04-09	Code
34	Weitu-VL-1.0	54.7	No	-	-	-
35	TroL-7B	54.7	No	TroL: Traversal of Layers for Large Language and...	2024-06-18	Code
36	Mini-Gemini	53	No	Mini-Gemini: Mining the Potential of Multi-modal...	2024-03-27	Code
37	CogVLM(Vicuna-7B)	52.8	No	CogVLM: Visual Expert for Pretrained Language Mo...	2023-11-06	Code
38	CogAgent	52.8	No	CogAgent: A Visual Language Model for GUI Agents	2023-12-14	Code
39	Qwen2-VL-2B (finetuned on GAP-VQA train)	52.43	No	Gamified crowd-sourcing of high-quality data for...	2024-10-05	-
40	InternVL2-26B (SGP, token ratio 9%)	52.1	No	A Stitch in Time Saves Nine: Small VLM is a Prec...	2024-12-04	Code
41	MM1.5-30B	52	No	MM1.5: Methods, Analysis & Insights from Multimo...	2024-09-30	-
42	MiniCPM-Llama3-V-2.5-8B (finetuned on GAP-VQA train)	51.789	No	Gamified crowd-sourcing of high-quality data for...	2024-10-05	-
43	IXC-2.5-7B	51.7	No	InternLM-XComposer-2.5: A Versatile Large Vision...	2024-07-03	Code
44	InternLM-XComposer2	51.2	No	InternLM-XComposer2: Mastering Free-form Text-Im...	2024-01-29	Code
45	Lyra-Mini	51.2	No	Lyra: An Efficient and Speech-Centric Framework ...	2024-12-12	Code
46	CuMo-7B	51	No	CuMo: Scaling Multimodal LLM with Co-Upcycled Mi...	2024-05-09	Code
47	TACO (Qwen2-7B / SigLIP)	50.9	No	TACO: Learning Multi-modal Action Models with Sy...	2024-12-07	Code
48	Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback))	50.7	No	VLFeedback: A Large-Scale AI Feedback Dataset fo...	2024-10-12	-
49	POINTS-9B	50	No	POINTS: Improving Your Vision-language Model wit...	2024-09-07	-
50	VILA^2-8B	50	No	VILA$^2$: VILA Augmented VILA	2024-07-24	-
51	Janus-Pro-7B	50	No	Janus-Pro: Unified Multimodal Understanding and ...	2025-01-29	Code
52	Silkie	49.9	No	Silkie: Preference Distillation for Large Visual...	2023-12-17	-
53	Silkie (Qwen-VL-Chat + DPO w/ VLFeedback)	49.9	No	VLFeedback: A Large-Scale AI Feedback Dataset fo...	2024-10-12	-
54	Qwen2-VL-2B	49.5	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
55	FlashSloth-HD	49	No	FlashSloth: Lightning Multimodal Large Language ...	2024-12-05	Code
56	InternVL 1.2	48.9	No	How Far Are We to GPT-4V? Closing the Gap to Com...	2024-04-25	Code
57	SEA-PRIME (Vicuna-13B)	48.8	No	SEA: Supervised Embedding Alignment for Token-Le...	2024-08-21	-
58	InternVL2.5-1B	48.8	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
59	MM1-30B-Chat	48.7	No	MM1: Methods, Analysis & Insights from Multimoda...	2024-03-14	-
60	SETOKIM (13B)	48.7	No	Towards Semantic Equivalence of Tokenization in ...	2024-06-07	-
61	Emu2-Chat	48.5	No	Generative Multimodal Models are In-Context Lear...	2023-12-20	Code
62	MG-LLaVA(34B)	48.5	No	MG-LLaVA: Towards Multi-Granularity Visual Instr...	2024-06-25	Code
63	SPHINX-Plus	47.9	No	SPHINX-X: Scaling Data and Parameters for a Fami...	2024-02-08	Code
64	ConvLLaVA	45.9	No	ConvLLaVA: Hierarchical Backbones as Visual Enco...	2024-05-24	Code
65	VILA-13B	45.7	No	VILA: On Pre-training for Visual Language Models	2023-12-12	Code
66	TACO (LLaMA3-8B / SigLIP)	45.7	No	TACO: Learning Multi-modal Action Models with Sy...	2024-12-07	Code
67	HPT 1.5 Edge	45.3	No	-	-	-
68	TACO (LLaMA3-8B / CLIP)	45.2	No	TACO: Learning Multi-modal Action Models with Sy...	2024-12-07	Code
69	LLaVA-v1.6 (7B, w/ STIC)	45	No	Enhancing Large Vision Language Models with Self...	2024-05-30	Code
70	H2OVL-Mississippi-2B	44.7	No	H2OVL-Mississippi Vision Language Models Technic...	2024-10-17	-
71	PIIP-LLaVA (Vicuna-7B, ConvNeXt-L, CLIP-L )	44.7	No	Parameter-Inverted Image Pyramid Networks for Vi...	2025-01-14	Code
72	Imp-4B	44.6	No	Imp: Highly Capable Large Multimodal Models for ...	2024-05-20	Code
73	LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback)	44.2	No	VLFeedback: A Large-Scale AI Feedback Dataset fo...	2024-10-12	-
74	MGM-7B+RP	44.1	No	Img-Diff: Contrastive Data Synthesis for Multimo...	2024-08-08	-
75	LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback)	44.1	No	VLFeedback: A Large-Scale AI Feedback Dataset fo...	2024-10-12	-
76	VW-LMM	44	No	Multi-modal Auto-regressive Modeling via Visual ...	2024-03-12	Code
77	MoAI	43.7	No	MoAI: Mixture of All Intelligence for Large Lang...	2024-03-12	Code
78	MM1-3B-Chat	43.7	No	MM1: Methods, Analysis & Insights from Multimoda...	2024-03-14	-
79	MM1.5-3B-MoE	43.7	No	MM1.5: Methods, Analysis & Insights from Multimo...	2024-09-30	-
80	Imp-3B	43.3	No	Imp: Highly Capable Large Multimodal Models for ...	2024-05-20	Code
81	ShareGPT4V-13B	43.1	No	ShareGPT4V: Improving Large Multi-Modal Models w...	2023-11-21	Code
82	Mini-Gemini (+MoCa)	42.9	No	Deciphering Cross-Modal Alignment in Large Visio...	2024-10-09	Code
83	MM1.5-7B	42.2	No	MM1.5: Methods, Analysis & Insights from Multimo...	2024-09-30	-
84	MM1-7B-Chat	42.1	No	MM1: Methods, Analysis & Insights from Multimoda...	2024-03-14	-
85	FlashSloth	41.9	No	FlashSloth: Lightning Multimodal Large Language ...	2024-12-05	Code
86	DeepSeek-VL	41.5	No	DeepSeek-VL: Towards Real-World Vision-Language ...	2024-03-08	Code
87	LaVA1.5-13B-BPO	41.4	No	Strengthening Multimodal Large Language Model wi...	2024-03-13	-
88	ASMv2	41.3	No	The All-Seeing Project V2: Towards General Relat...	2024-02-29	Code
89	FocusLLaVA	41.3	No	FocusLLaVA: A Coarse-to-Fine Approach for Effici...	2024-11-21	-
90	SeVa-13B	41	No	Self-Supervised Visual Preference Alignment	2024-04-16	Code
91	MM1.5-3B	41	No	MM1.5: Methods, Analysis & Insights from Multimo...	2024-09-30	-
92	LLaVA-1.5-7B (VG-S)	40.4	No	ProVision: Programmatically Scaling Vision-centr...	2024-12-09	Code
93	CoLLaVO	40.3	No	CoLLaVO: Crayon Large Language and Vision mOdel	2024-02-17	Code
94	SPHINX-2k	40.2	No	SPHINX: The Joint Mixing of Weights, Tasks, and ...	2023-11-13	Code
95	LLaVA-1.5 (LVIS-Instrcut4V)	40.2	No	To See is to Believe: Prompting GPT-4V for Bette...	2023-11-13	Code
96	mPLUG-Owl3	40.1	No	mPLUG-Owl3: Towards Long Image-Sequence Understa...	2024-08-09	Code
97	Mono-InternVL-2B	40.1	No	Mono-InternVL: Pushing the Boundaries of Monolit...	2024-10-10	-
98	LLaVA1.5-13B-MDA	39.9	No	Looking Beyond Text: Reducing Language bias in L...	2024-11-21	-
99	LLaVA-VT (Vicuna-13B)	39.8	No	Beyond Embeddings: The Promise of Visual Table i...	2024-03-27	Code
100	MM1.5-1B-MoE	39.8	No	MM1.5: Methods, Analysis & Insights from Multimo...	2024-09-30	-
101	Janus-Pro-1B	39.8	No	Janus-Pro: Unified Multimodal Understanding and ...	2025-01-29	Code
102	SQ-LLaVA∗	39.7	No	SQ-LLaVA: Self-Questioning for Large Vision-Lang...	2024-03-17	Code
103	OmniFusion (grid split + ruDocVQA)	39.4	No	OmniFusion Technical Report	2024-04-09	-
104	DeepStack-L-HD (Vicuna-13B)	39.3	No	DeepStack: Deeply Stacking Visual Tokens is Surp...	2024-06-06	-
105	LAF-13B	38.9	No	From Training-Free to Adaptive: Empirical Insigh...	2024-01-31	-
106	InfiMM-HD	38.9	No	InfiMM-HD: A Leap Forward in High-Resolution Mul...	2024-03-03	-
107	InternLM-XC2 + MMDU-45k	38.8	No	MMDU: A Multi-Turn Multi-Image Dialog Understand...	2024-06-17	Code
108	LLaVA-1.5-7B (DC-S)	38.5	No	ProVision: Programmatically Scaling Vision-centr...	2024-12-09	Code
109	LayoutLMv3+ConvNeXt+CLIP	38.4	No	MouSi: Poly-Visual-Expert Vision-Language Models	2024-01-30	Code
110	VOLCANO 13B	38	No	Volcano: Mitigating Multimodal Hallucination thr...	2023-11-13	Code
111	LLaVA-1.5+MMInstruct (Vicuna-13B)	37.9	No	MMInstruct: A High-Quality Multi-Modal Instructi...	2024-07-22	Code
112	LLaVA-1.5-13B (+CSR)	37.8	No	Calibrated Self-Rewarding Vision Language Models	2024-05-23	Code
113	LLaVA-1.5-LLaMA3-8B	37.8	No	What If We Recaption Billions of Web Images with...	2024-06-12	-
114	LLaVA-1.5 + DenseFusion-1M (Vicuna-7B)	37.8	No	DenseFusion-1M: Merging Vision Experts for Compr...	2024-07-11	Code
115	ShareGPT4V-7B	37.6	No	ShareGPT4V: Improving Large Multi-Modal Models w...	2023-11-21	Code
116	LLaVA-1.5+CoS	37.6	No	Chain-of-Spot: Interactive Reasoning Improves La...	2024-03-19	Code
117	LLaVA-COCO-13B	37.5	No	COCO is "ALL'' You Need for Visual Instruction F...	2024-01-17	-
118	LLaVA-S^2 + DenseFusion-1M (Vicuna-7B)	37.5	No	DenseFusion-1M: Merging Vision Experts for Compr...	2024-07-11	Code
119	MM1.5-1B	37.4	No	MM1.5: Methods, Analysis & Insights from Multimo...	2024-09-30	-
120	Dynamic-LLaVA-13B	37.3	No	Dynamic-LLaVA: Efficient Multimodal Large Langua...	2024-12-01	Code
121	SeVa-7B	37.2	No	Self-Supervised Visual Preference Alignment	2024-04-16	Code
122	SoM-LLaVA-1.5-T	37.2	No	List Items One by One: A New Data Source and Lea...	2024-04-25	Code
123	Emu3	37.2	No	Emu3: Next-Token Prediction is All You Need	2024-09-27	Code
124	LLaVA-Instruct (Vicuna-1.5-13B)	37.1	No	MM-Instruct: Generated Visual Instructions for L...	2024-06-28	Code
125	ILLUME	37	No	ILLUME: Illuminating Your LLMs to See, Draw, and...	2024-12-09	-
126	LLaVA1.5-7B-BPO	36.8	No	Strengthening Multimodal Large Language Model wi...	2024-03-13	-
127	LLaVA-1.5-13B (+ MMFuser)	36.6	No	MMFuser: Multimodal Multi-Layer Feature Fuser fo...	2024-10-15	Code
128	CaMML-13B	36.4	No	CaMML: Context-Aware Multimodal Learner for Larg...	2024-01-06	Code
129	LLaVA-65B (Data Mixing)	36.4	No	An Empirical Study of Scaling Instruct-Tuned Lar...	2023-09-18	Code
130	Vary-base	36.2	No	Vary: Scaling up the Vision Vocabulary for Large...	2023-12-11	Code
131	StableLLaVA	36.1	No	StableLLaVA: Enhanced Visual Instruction Tuning ...	2023-08-20	Code
132	DreamLLM-7B	35.9	No	DreamLLM: Synergistic Multimodal Comprehension a...	2023-09-20	Code
133	MoE-LLaVA-2.7B×4-Top2	35.9	No	MoE-LLaVA: Mixture of Experts for Large Vision-L...	2024-01-29	Code
134	SoM-LLaVA-1.5	35.9	No	List Items One by One: A New Data Source and Lea...	2024-04-25	Code
135	Dragonfly (Llama3-8B)	35.9	No	Dragonfly: Multi-Resolution Zoom-In Encoding Enh...	2024-06-03	Code
136	Ferret-v2-13B	35.7	No	Ferret-v2: An Improved Baseline for Referring an...	2024-04-11	Code
137	AlignGPT (Vicuna-13B)	35.6	No	AlignGPT: Multi-modal Large Language Models with...	2024-05-23	-
138	LLaVA-HR-X	35.5	No	Feast Your Eyes: Mixture-of-Resolution Adaptatio...	2024-03-05	Code
139	SQ-LLaVA	35.5	No	SQ-LLaVA: Self-Questioning for Large Vision-Lang...	2024-03-17	Code
140	LOVA$^3$	35.2	No	LOVA3: Learning to Visual Question Answering, As...	2024-05-23	Code
141	LLaVA-InternLM2-7B-ViT + MoSLoRA	35.2	No	Mixture-of-Subspaces in Low-Rank Adaptation	2024-06-16	Code
142	InternLM2+ViT (QMoSLoRA)	35.2	No	Mixture-of-Subspaces in Low-Rank Adaptation	2024-06-16	Code
143	LLaVA1.5-7B-MDA	35.2	No	Looking Beyond Text: Reducing Language bias in L...	2024-11-21	-
144	Mipha-3B+	35.1	No	Rethinking Visual Prompting for Multimodal Large...	2024-07-05	-
145	Merlin	34.9	No	Merlin:Empowering Multimodal LLMs with Foresight...	2023-11-30	-
146	Arcana	34.8	No	Improving Multi-modal Large Language Model throu...	2024-10-17	-
147	INF-LLaVA	34.5	No	INF-LLaVA: Dual-perspective Perception for High-...	2024-07-23	Code
148	LLaVA-1.5+MMInstruct (Vicuna-7B)	34.4	No	MMInstruct: A High-Quality Multi-Modal Instructi...	2024-07-22	Code
149	Janus	34.3	No	Janus: Decoupling Visual Encoding for Unified Mu...	2024-10-17	Code
150	LLaVA-TokenPacker (Vicuna-13B)	34.1	No	TokenPacker: Efficient Visual Projector for Mult...	2024-07-02	Code
151	γ-MoD-LLaVA-HR	34	No	$γ-$MoD: Exploring Mixture-of-Depth Adaptation f...	2024-10-17	-
152	LLaVA-1.5-7B (CSR)	33.9	No	Calibrated Self-Rewarding Vision Language Models	2024-05-23	Code
153	DynMOE-LLaVA	33.6	No	Dynamic Mixture of Experts: An Auto-Tuning Appro...	2024-05-23	Code
154	Imp-2B	33.5	No	Imp: Highly Capable Large Multimodal Models for ...	2024-05-20	Code
155	InfMLLM-7B-Chat	33.4	No	InfMLLM: A Unified Framework for Visual-Language...	2023-11-12	Code
156	Video-LaVIT	33.2	No	Video-LaVIT: Unified Video-Language Pre-training...	2024-02-05	Code
157	LLaVA-Instruct (Vicuna-1.5-7B)	32.9	No	MM-Instruct: Generated Visual Instructions for L...	2024-06-28	Code
158	VisionZip (Retain 128 Tokens, fine-tuning)	32.9	No	VisionZip: Longer is Better but Not Necessary in...	2024-12-05	Code
159	Uni-MoE	32.8	No	Uni-MoE: Scaling Unified Multimodal LLMs with Mi...	2024-05-18	Code
160	VL-Mamba (Mamba LLM-2.8B)	32.6	No	VL-Mamba: Exploring State Space Models for Multi...	2024-03-20	-
161	LLaVA-v1.5 (7B, w/ STIC)	32.6	No	Enhancing Large Vision Language Models with Self...	2024-05-30	Code
162	VisionZip (Retain 192 Tokens, fine-tuning)	32.6	No	VisionZip: Longer is Better but Not Necessary in...	2024-12-05	Code
163	VisionZip (Retain 128 Tokens)	32.6	No	VisionZip: Longer is Better but Not Necessary in...	2024-12-05	Code
164	LLaVA-v1.5 (+MoCa)	32.2	No	Deciphering Cross-Modal Alignment in Large Visio...	2024-10-09	Code
165	Dynamic-LLaVA-7B	32.2	No	Dynamic-LLaVA: Efficient Multimodal Large Langua...	2024-12-01	Code
166	Mipha-3B	32.1	No	Mipha: A Comprehensive Overhaul of Multimodal As...	2024-03-10	Code
167	VOLCANO 7B	32	No	Volcano: Mitigating Multimodal Hallucination thr...	2023-11-13	Code
168	Video-LLaVA	32	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
169	TinyLLaVA-share-Sig-Ph	32	No	TinyLLaVA: A Framework of Small-scale Large Mult...	2024-02-22	Code
170	LLaVA-VT (Vicuna-7B)	31.8	No	Beyond Embeddings: The Promise of Visual Table i...	2024-03-27	Code
171	VisionZip (Retain 192 Tokens)	31.7	No	VisionZip: Longer is Better but Not Necessary in...	2024-12-05	Code
172	VisionZip (Retain 64 Tokens)	31.7	No	VisionZip: Longer is Better but Not Necessary in...	2024-12-05	Code
173	LLaVA-1.5-7B (+ SIMA)	31.6	No	Enhancing Visual-Language Modality Alignment in ...	2024-05-24	Code
174	MiCo-Chat-7B	31.4	No	Explore the Limits of Omni-modal Pretraining at ...	2024-06-13	Code
175	LLaVA-1.5-7B + TeamLoRA	31.2	No	TeamLoRA: Boosting Low-Rank Adaptation with Expe...	2024-08-19	Code
176	RoboCodeX-13B	31	No	RoboCodeX: Multimodal Code Generation for Roboti...	2024-02-25	-
177	HyperLLaVA	31	No	HyperLLaVA: Dynamic Visual and Language Expert T...	2024-03-20	Code
178	FAST (Vicuna-7B)	31	No	Visual Agents as Fast and Slow Thinkers	2024-08-16	Code
179	JanusFlow	30.9	No	JanusFlow: Harmonizing Autoregression and Rectif...	2024-11-12	Code
180	AlignGPT (Vicuna-7B)	30.8	No	AlignGPT: Multi-modal Large Language Models with...	2024-05-23	-
181	LLaVolta	30.7	No	Efficient Large Multi-modal Models via Visual Co...	2024-06-28	Code
182	LLaVA-AlignedVQ	30.7	No	Aligned Vector Quantization for Edge-Cloud Colla...	2024-11-08	-
183	LLaVA-1.5-HACL	30.4	No	Hallucination Augmented Contrastive Learning for...	2023-12-12	Code
184	MaVEn	30.4	No	MaVEn: An Effective Multi-granularity Hybrid Vis...	2024-08-22	-
185	VisionZip (Retain 64 Tokens, fine-tuning)	30.2	No	VisionZip: Longer is Better but Not Necessary in...	2024-12-05	Code
186	H2OVL-Mississippi-0.8B	30	No	H2OVL-Mississippi Vision Language Models Technic...	2024-10-17	-
187	RoboMamba	29.7	No	RoboMamba: Efficient Vision-Language-Action Mode...	2024-06-06	-
188	LLaVA-TokenPacker (Vicuna-7B)	29.6	No	TokenPacker: Efficient Visual Projector for Mult...	2024-07-02	Code
189	OneLLM-7B	29.1	No	OneLLM: One Framework to Align All Modalities wi...	2023-12-06	Code
190	LLaVA-OneVision-0.5B	29.1	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
191	Vary-toy	29	No	Small Language Model Meets with Reinforced Visio...	2024-01-23	-
192	LLaVA-Phi	28.9	No	LLaVA-Phi: Efficient Multi-Modal Assistant with ...	2024-01-04	Code
193	MMAR-7B	27.8	No	MMAR: Towards Lossless Multi-Modal Auto-Regressi...	2024-10-14	-
194	SEAL (7B)	27.7	No	V*: Guided Visual Search as a Core Mechanism in ...	2023-12-21	Code
195	OtterHD-8B	26.3	No	OtterHD: A High-Resolution Multi-modality Model	2023-11-07	Code
196	TGA-7B	25.6	No	Cross-Modal Safety Mechanism Transfer in Large V...	2024-10-16	-
197	LinVT	23.5	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
198	Xmodel-VLM (Xmodel-LM 1.1B)	21.8	No	Xmodel-VLM: A Simple Baseline for Multimodal Vis...	2024-05-15	Code
199	TextBind	19.4	No	TextBind: Multi-turn Interleaved Multimodal Inst...	2023-09-14	Code
200	MMAR-0.5B	18.49	No	MMAR: Towards Lossless Multi-Modal Auto-Regressi...	2024-10-14	-