TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MiniGPT-4: Enhancing Vision-Language Understanding with Ad...

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

2023-04-20Spatial ReasoningVideo Question AnsweringVisual ReasoningLarge Language ModelVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCodeCodeCode(official)CodeCodeCode

Abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)AutoHallusionOverall Accuracy51miniGPT4
Visual Question Answering (VQA)InfiMM-EvalAbductive13.28MiniGPT-v2
Visual Question Answering (VQA)InfiMM-EvalAnalogical5.69MiniGPT-v2
Visual Question Answering (VQA)InfiMM-EvalDeductive11.02MiniGPT-v2
Visual Question Answering (VQA)InfiMM-EvalOverall score10.43MiniGPT-v2
Visual Question Answering (VQA)BenchLMMGPT-3.5 score34.93MiniGPT4-13B
Visual Question Answering (VQA)EmbSpatial-BenchGeneration23.54MiniGPT4
Video Question AnsweringMVBenchAvg.18.8MiniGPT4
Visual Question AnsweringBenchLMMGPT-3.5 score34.93MiniGPT4-13B
Visual Question AnsweringEmbSpatial-BenchGeneration23.54MiniGPT4

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17