Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 60.8 | Mini-Gemini-HD-BS |
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 59.3 | Mini-Gemini-HD |
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 53 | Mini-Gemini |
| Image Classification | ColonINST-v1 (Seen) | Accuray | 93.24 | MGM-2B (w/o LoRA, w/ extra data) |
| Image Classification | ColonINST-v1 (Seen) | Accuray | 92.97 | MGM-2B (w/o LoRA, w/o extra data) |
| Image Classification | ColonINST-v1 (Unseen) | Accuray | 78.99 | MGM-2B (w/o LoRA, w/o extra data) |
| Image Classification | ColonINST-v1 (Unseen) | Accuray | 78.69 | MGM-2B (w/o LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Unseen) | Accuray | 74.3 | MGM-2B (w/o LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Unseen) | Accuray | 69.81 | MGM-2B (w/o LoRA, w/o extra data) |
| Referring expression generation | ColonINST-v1 (Seen) | Accuray | 98.75 | MGM-2B (w/o LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Seen) | Accuray | 98.17 | MGM-2B (w/o LoRA, w/o extra data) |
| Visual Question Answering | MM-Vet | GPT-4 score | 60.8 | Mini-Gemini-HD-BS |
| Visual Question Answering | MM-Vet | GPT-4 score | 59.3 | Mini-Gemini-HD |
| Visual Question Answering | MM-Vet | GPT-4 score | 53 | Mini-Gemini |