Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee
Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in a grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot video question answering benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | NExT-QA | Accuracy | 70.9 | IG-VLM(LLaVA v1.6) |
| Question Answering | NExT-QA | Accuracy | 68.6 | IG-VLM (GPT-4) |
| Question Answering | MSVD-QA | Accuracy | 79.6 | IG-VLM-34B |
| Question Answering | MSVD-QA | Confidence Score | 4.1 | IG-VLM-34B |
| Question Answering | TGIF-QA | Accuracy | 79.1 | IG-VLM |
| Question Answering | TGIF-QA | Confidence Score | 4.2 | IG-VLM |
| Question Answering | MSRVTT-QA | Accuracy | 63.8 | IG-VLM |
| Question Answering | MSRVTT-QA | Confidence Score | 3.5 | IG-VLM |
| Question Answering | IntentQA | Accuracy | 65.3 | IG-VLM |
| Question Answering | TVQA | Accuracy | 57.8 | IG-VLM (no speech, GPT-4V) |
| Question Answering | ActivityNet-QA | Accuracy | 58.4 | IG-VLM |
| Question Answering | ActivityNet-QA | Confidence Score | 3.5 | IG-VLM |
| Visual Question Answering (VQA) | VideoInstruct | Consistency | 3.13 | IG-VLM-GPT4v |
| Visual Question Answering (VQA) | VideoInstruct | Contextual Understanding | 3.61 | IG-VLM-GPT4v |
| Visual Question Answering (VQA) | VideoInstruct | Correctness of Information | 3.4 | IG-VLM-GPT4v |
| Visual Question Answering (VQA) | VideoInstruct | Detail Orientation | 2.8 | IG-VLM-GPT4v |
| Visual Question Answering (VQA) | VideoInstruct | Temporal Understanding | 2.89 | IG-VLM-GPT4v |
| Visual Question Answering (VQA) | VideoInstruct | mean | 3.17 | IG-VLM-GPT4v |
| Video Question Answering | NExT-QA | Accuracy | 70.9 | IG-VLM(LLaVA v1.6) |
| Video Question Answering | NExT-QA | Accuracy | 68.6 | IG-VLM (GPT-4) |
| Video Question Answering | MSVD-QA | Accuracy | 79.6 | IG-VLM-34B |
| Video Question Answering | MSVD-QA | Confidence Score | 4.1 | IG-VLM-34B |
| Video Question Answering | TGIF-QA | Accuracy | 79.1 | IG-VLM |
| Video Question Answering | TGIF-QA | Confidence Score | 4.2 | IG-VLM |
| Video Question Answering | MSRVTT-QA | Accuracy | 63.8 | IG-VLM |
| Video Question Answering | MSRVTT-QA | Confidence Score | 3.5 | IG-VLM |
| Video Question Answering | IntentQA | Accuracy | 65.3 | IG-VLM |
| Video Question Answering | TVQA | Accuracy | 57.8 | IG-VLM (no speech, GPT-4V) |
| Video Question Answering | ActivityNet-QA | Accuracy | 58.4 | IG-VLM |
| Video Question Answering | ActivityNet-QA | Confidence Score | 3.5 | IG-VLM |
| Generative Visual Question Answering | VideoInstruct | Consistency | 3.13 | IG-VLM-GPT4v |
| Generative Visual Question Answering | VideoInstruct | Contextual Understanding | 3.61 | IG-VLM-GPT4v |
| Generative Visual Question Answering | VideoInstruct | Correctness of Information | 3.4 | IG-VLM-GPT4v |
| Generative Visual Question Answering | VideoInstruct | Detail Orientation | 2.8 | IG-VLM-GPT4v |
| Generative Visual Question Answering | VideoInstruct | Temporal Understanding | 2.89 | IG-VLM-GPT4v |
| Generative Visual Question Answering | VideoInstruct | mean | 3.17 | IG-VLM-GPT4v |
| Video-based Generative Performance Benchmarking | VideoInstruct | Consistency | 3.13 | IG-VLM-GPT4v |
| Video-based Generative Performance Benchmarking | VideoInstruct | Contextual Understanding | 3.61 | IG-VLM-GPT4v |
| Video-based Generative Performance Benchmarking | VideoInstruct | Correctness of Information | 3.4 | IG-VLM-GPT4v |
| Video-based Generative Performance Benchmarking | VideoInstruct | Detail Orientation | 2.8 | IG-VLM-GPT4v |
| Video-based Generative Performance Benchmarking | VideoInstruct | Temporal Understanding | 2.89 | IG-VLM-GPT4v |
| Video-based Generative Performance Benchmarking | VideoInstruct | mean | 3.17 | IG-VLM-GPT4v |