TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Mini-Gemini: Mining the Potential of Multi-modality Vision...

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia

2024-03-27Visual DialogImage ClassificationReferring expression generationImage ComprehensionReferring Expression ComprehensionVisual Question Answering
PaperPDFCodeCode(official)

Abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MM-VetGPT-4 score60.8Mini-Gemini-HD-BS
Visual Question Answering (VQA)MM-VetGPT-4 score59.3Mini-Gemini-HD
Visual Question Answering (VQA)MM-VetGPT-4 score53Mini-Gemini
Image ClassificationColonINST-v1 (Seen)Accuray93.24MGM-2B (w/o LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Seen)Accuray92.97MGM-2B (w/o LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray78.99MGM-2B (w/o LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray78.69MGM-2B (w/o LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray74.3MGM-2B (w/o LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray69.81MGM-2B (w/o LoRA, w/o extra data)
Referring expression generationColonINST-v1 (Seen)Accuray98.75MGM-2B (w/o LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Seen)Accuray98.17MGM-2B (w/o LoRA, w/o extra data)
Visual Question AnsweringMM-VetGPT-4 score60.8Mini-Gemini-HD-BS
Visual Question AnsweringMM-VetGPT-4 score59.3Mini-Gemini-HD
Visual Question AnsweringMM-VetGPT-4 score53Mini-Gemini

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks2025-07-14