TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/An Image Grid Can Be Worth a Video: Zero-shot Video Questi...

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

2024-03-27Zero-Shot Video Question AnswerQuestion AnsweringVideo-based Generative Performance BenchmarkingVideo Question AnsweringLanguage ModellingMultiple-choice
PaperPDFCode(official)

Abstract

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in a grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot video question answering benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy70.9IG-VLM(LLaVA v1.6)
Question AnsweringNExT-QAAccuracy68.6IG-VLM (GPT-4)
Question AnsweringMSVD-QAAccuracy79.6IG-VLM-34B
Question AnsweringMSVD-QAConfidence Score4.1IG-VLM-34B
Question AnsweringTGIF-QAAccuracy79.1IG-VLM
Question AnsweringTGIF-QAConfidence Score4.2IG-VLM
Question AnsweringMSRVTT-QAAccuracy63.8IG-VLM
Question AnsweringMSRVTT-QAConfidence Score3.5IG-VLM
Question AnsweringIntentQAAccuracy65.3IG-VLM
Question AnsweringTVQAAccuracy57.8IG-VLM (no speech, GPT-4V)
Question AnsweringActivityNet-QAAccuracy58.4IG-VLM
Question AnsweringActivityNet-QAConfidence Score3.5IG-VLM
Visual Question Answering (VQA)VideoInstructConsistency3.13IG-VLM-GPT4v
Visual Question Answering (VQA)VideoInstructContextual Understanding3.61IG-VLM-GPT4v
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.4IG-VLM-GPT4v
Visual Question Answering (VQA)VideoInstructDetail Orientation2.8IG-VLM-GPT4v
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.89IG-VLM-GPT4v
Visual Question Answering (VQA)VideoInstructmean3.17IG-VLM-GPT4v
Video Question AnsweringNExT-QAAccuracy70.9IG-VLM(LLaVA v1.6)
Video Question AnsweringNExT-QAAccuracy68.6IG-VLM (GPT-4)
Video Question AnsweringMSVD-QAAccuracy79.6IG-VLM-34B
Video Question AnsweringMSVD-QAConfidence Score4.1IG-VLM-34B
Video Question AnsweringTGIF-QAAccuracy79.1IG-VLM
Video Question AnsweringTGIF-QAConfidence Score4.2IG-VLM
Video Question AnsweringMSRVTT-QAAccuracy63.8IG-VLM
Video Question AnsweringMSRVTT-QAConfidence Score3.5IG-VLM
Video Question AnsweringIntentQAAccuracy65.3IG-VLM
Video Question AnsweringTVQAAccuracy57.8IG-VLM (no speech, GPT-4V)
Video Question AnsweringActivityNet-QAAccuracy58.4IG-VLM
Video Question AnsweringActivityNet-QAConfidence Score3.5IG-VLM
Generative Visual Question AnsweringVideoInstructConsistency3.13IG-VLM-GPT4v
Generative Visual Question AnsweringVideoInstructContextual Understanding3.61IG-VLM-GPT4v
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.4IG-VLM-GPT4v
Generative Visual Question AnsweringVideoInstructDetail Orientation2.8IG-VLM-GPT4v
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.89IG-VLM-GPT4v
Generative Visual Question AnsweringVideoInstructmean3.17IG-VLM-GPT4v
Video-based Generative Performance BenchmarkingVideoInstructConsistency3.13IG-VLM-GPT4v
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.61IG-VLM-GPT4v
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.4IG-VLM-GPT4v
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.8IG-VLM-GPT4v
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.89IG-VLM-GPT4v
Video-based Generative Performance BenchmarkingVideoInstructmean3.17IG-VLM-GPT4v

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17