An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

2024-03-27Zero-Shot Video Question Answer Question Answering Video-based Generative Performance Benchmarking Video Question Answering Language Modelling Multiple-choice

Paper PDF Code(official)

Abstract

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in a grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot video question answering benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA	Accuracy	70.9	IG-VLM(LLaVA v1.6)
Question Answering	NExT-QA	Accuracy	68.6	IG-VLM (GPT-4)
Question Answering	MSVD-QA	Accuracy	79.6	IG-VLM-34B
Question Answering	MSVD-QA	Confidence Score	4.1	IG-VLM-34B
Question Answering	TGIF-QA	Accuracy	79.1	IG-VLM
Question Answering	TGIF-QA	Confidence Score	4.2	IG-VLM
Question Answering	MSRVTT-QA	Accuracy	63.8	IG-VLM
Question Answering	MSRVTT-QA	Confidence Score	3.5	IG-VLM
Question Answering	IntentQA	Accuracy	65.3	IG-VLM
Question Answering	TVQA	Accuracy	57.8	IG-VLM (no speech, GPT-4V)
Question Answering	ActivityNet-QA	Accuracy	58.4	IG-VLM
Question Answering	ActivityNet-QA	Confidence Score	3.5	IG-VLM
Visual Question Answering (VQA)	VideoInstruct	Consistency	3.13	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.61	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.4	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.8	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.89	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	mean	3.17	IG-VLM-GPT4v
Video Question Answering	NExT-QA	Accuracy	70.9	IG-VLM(LLaVA v1.6)
Video Question Answering	NExT-QA	Accuracy	68.6	IG-VLM (GPT-4)
Video Question Answering	MSVD-QA	Accuracy	79.6	IG-VLM-34B
Video Question Answering	MSVD-QA	Confidence Score	4.1	IG-VLM-34B
Video Question Answering	TGIF-QA	Accuracy	79.1	IG-VLM
Video Question Answering	TGIF-QA	Confidence Score	4.2	IG-VLM
Video Question Answering	MSRVTT-QA	Accuracy	63.8	IG-VLM
Video Question Answering	MSRVTT-QA	Confidence Score	3.5	IG-VLM
Video Question Answering	IntentQA	Accuracy	65.3	IG-VLM
Video Question Answering	TVQA	Accuracy	57.8	IG-VLM (no speech, GPT-4V)
Video Question Answering	ActivityNet-QA	Accuracy	58.4	IG-VLM
Video Question Answering	ActivityNet-QA	Confidence Score	3.5	IG-VLM
Generative Visual Question Answering	VideoInstruct	Consistency	3.13	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.61	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.4	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.8	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.89	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	mean	3.17	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	3.13	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.61	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.4	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.8	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.89	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.17	IG-VLM-GPT4v

Abstract

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA	Accuracy	70.9	IG-VLM(LLaVA v1.6)
Question Answering	NExT-QA	Accuracy	68.6	IG-VLM (GPT-4)
Question Answering	MSVD-QA	Accuracy	79.6	IG-VLM-34B
Question Answering	MSVD-QA	Confidence Score	4.1	IG-VLM-34B
Question Answering	TGIF-QA	Accuracy	79.1	IG-VLM
Question Answering	TGIF-QA	Confidence Score	4.2	IG-VLM
Question Answering	MSRVTT-QA	Accuracy	63.8	IG-VLM
Question Answering	MSRVTT-QA	Confidence Score	3.5	IG-VLM
Question Answering	IntentQA	Accuracy	65.3	IG-VLM
Question Answering	TVQA	Accuracy	57.8	IG-VLM (no speech, GPT-4V)
Question Answering	ActivityNet-QA	Accuracy	58.4	IG-VLM
Question Answering	ActivityNet-QA	Confidence Score	3.5	IG-VLM
Visual Question Answering (VQA)	VideoInstruct	Consistency	3.13	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.61	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.4	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.8	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.89	IG-VLM-GPT4v
Visual Question Answering (VQA)	VideoInstruct	mean	3.17	IG-VLM-GPT4v
Video Question Answering	NExT-QA	Accuracy	70.9	IG-VLM(LLaVA v1.6)
Video Question Answering	NExT-QA	Accuracy	68.6	IG-VLM (GPT-4)
Video Question Answering	MSVD-QA	Accuracy	79.6	IG-VLM-34B
Video Question Answering	MSVD-QA	Confidence Score	4.1	IG-VLM-34B
Video Question Answering	TGIF-QA	Accuracy	79.1	IG-VLM
Video Question Answering	TGIF-QA	Confidence Score	4.2	IG-VLM
Video Question Answering	MSRVTT-QA	Accuracy	63.8	IG-VLM
Video Question Answering	MSRVTT-QA	Confidence Score	3.5	IG-VLM
Video Question Answering	IntentQA	Accuracy	65.3	IG-VLM
Video Question Answering	TVQA	Accuracy	57.8	IG-VLM (no speech, GPT-4V)
Video Question Answering	ActivityNet-QA	Accuracy	58.4	IG-VLM
Video Question Answering	ActivityNet-QA	Confidence Score	3.5	IG-VLM
Generative Visual Question Answering	VideoInstruct	Consistency	3.13	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.61	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.4	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.8	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.89	IG-VLM-GPT4v
Generative Visual Question Answering	VideoInstruct	mean	3.17	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	3.13	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.61	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.4	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.8	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.89	IG-VLM-GPT4v
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.17	IG-VLM-GPT4v

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Abstract

Results

Related Papers

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Abstract

Results

Related Papers