Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia

2024-03-27Visual Dialog Image Classification Referring expression generation Image Comprehension Referring Expression Comprehension Visual Question Answering

Paper PDF Code Code(official)

Abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	60.8	Mini-Gemini-HD-BS
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	59.3	Mini-Gemini-HD
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	53	Mini-Gemini
Image Classification	ColonINST-v1 (Seen)	Accuray	93.24	MGM-2B (w/o LoRA, w/ extra data)
Image Classification	ColonINST-v1 (Seen)	Accuray	92.97	MGM-2B (w/o LoRA, w/o extra data)
Image Classification	ColonINST-v1 (Unseen)	Accuray	78.99	MGM-2B (w/o LoRA, w/o extra data)
Image Classification	ColonINST-v1 (Unseen)	Accuray	78.69	MGM-2B (w/o LoRA, w/ extra data)
Referring expression generation	ColonINST-v1 (Unseen)	Accuray	74.3	MGM-2B (w/o LoRA, w/ extra data)
Referring expression generation	ColonINST-v1 (Unseen)	Accuray	69.81	MGM-2B (w/o LoRA, w/o extra data)
Referring expression generation	ColonINST-v1 (Seen)	Accuray	98.75	MGM-2B (w/o LoRA, w/ extra data)
Referring expression generation	ColonINST-v1 (Seen)	Accuray	98.17	MGM-2B (w/o LoRA, w/o extra data)
Visual Question Answering	MM-Vet	GPT-4 score	60.8	Mini-Gemini-HD-BS
Visual Question Answering	MM-Vet	GPT-4 score	59.3	Mini-Gemini-HD
Visual Question Answering	MM-Vet	GPT-4 score	53	Mini-Gemini

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Abstract

Results

Related Papers

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Abstract

Results

Related Papers