TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MiniGPT4-Video: Advancing Multimodal LLMs for Video Unders...

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny

2024-04-04Zero-Shot Video Question AnswerZeroshot Video Question AnswerVideo-based Generative Performance Benchmarking (Contextual Understanding)Multimodal Large Language ModelVideo-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Large Language ModelVideo-based Generative Performance Benchmarking (Detail Orientation))Video UnderstandingLanguage ModellingMultiple-choice
PaperPDFCodeCode(official)

Abstract

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here https://vision-cair.github.io/MiniGPT4-video/

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy73.92MiniGPT4-video-7B
Question AnsweringTGIF-QAAccuracy72.22MiniGPT4-video-7B
Question AnsweringMSRVTT-QAAccuracy59.73MiniGPT4-video-7B
Question AnsweringTVQAAccuracy54.21MiniGPT4-video-7B
Question AnsweringActivityNet-QAAccuracy46.3MiniGPT4-video-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.57MiniGPT4-video-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.08MiniGPT4-video-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.02MiniGPT4-video-7B
Visual Question Answering (VQA)VideoInstructgpt-score2.65MiniGPT4-video-7B
Visual Question Answering (VQA)VideoInstructgpt-score2.67MiniGPT4-video-7B
Video Question AnsweringMSVD-QAAccuracy73.92MiniGPT4-video-7B
Video Question AnsweringTGIF-QAAccuracy72.22MiniGPT4-video-7B
Video Question AnsweringMSRVTT-QAAccuracy59.73MiniGPT4-video-7B
Video Question AnsweringTVQAAccuracy54.21MiniGPT4-video-7B
Video Question AnsweringActivityNet-QAAccuracy46.3MiniGPT4-video-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.57MiniGPT4-video-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.08MiniGPT4-video-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.02MiniGPT4-video-7B
Generative Visual Question AnsweringVideoInstructgpt-score2.65MiniGPT4-video-7B
Generative Visual Question AnsweringVideoInstructgpt-score2.67MiniGPT4-video-7B
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.08MiniGPT4-video-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.57MiniGPT4-video-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.08MiniGPT4-video-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.02MiniGPT4-video-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.65MiniGPT4-video-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.67MiniGPT4-video-7B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17