TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens

2024-11-17Zero-Shot Video Question Answer Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding

Paper PDF Code(official)

Abstract

Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at https://github.com/tingyu215/TS-LLaVA.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA	Accuracy	73.6	TS-LLaVA-34B
Question Answering	MVBench	Accuracy	52.6	TS-LLaVA-34B
Question Answering	MSVD-QA	Accuracy	79.4	TS-LLaVA-34B
Question Answering	MSVD-QA	Confidence Score	4.1	TS-LLaVA-34B
Question Answering	TGIF-QA	Accuracy	81	TS-LLaVA-34B
Question Answering	TGIF-QA	Confidence Score	4.2	TS-LLaVA-34B
Question Answering	MSRVTT-QA	Accuracy	66.2	TS-LLaVA-34B
Question Answering	MSRVTT-QA	Confidence Score	3.6	TS-LLaVA-34B
Question Answering	IntentQA	Accuracy	67.9	TS-LLaVA-34B
Question Answering	EgoSchema (subset)	Accuracy	57.8	TS-LLaVA-34B
Question Answering	ActivityNet-QA	Accuracy	58.9	TS-LLaVA-34B
Question Answering	ActivityNet-QA	Confidence Score	3.5	TS-LLaVA-34B
Visual Question Answering (VQA)	VideoInstruct	mean	3.38	TS-LLaVA-34B
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.86	TS-LLaVA-34B
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.55	TS-LLaVA-34B
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.03	TS-LLaVA-34B
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.77	TS-LLaVA-34B
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.69	TS-LLaVA-34B
Video Question Answering	NExT-QA	Accuracy	73.6	TS-LLaVA-34B
Video Question Answering	MVBench	Accuracy	52.6	TS-LLaVA-34B
Video Question Answering	MSVD-QA	Accuracy	79.4	TS-LLaVA-34B
Video Question Answering	MSVD-QA	Confidence Score	4.1	TS-LLaVA-34B
Video Question Answering	TGIF-QA	Accuracy	81	TS-LLaVA-34B
Video Question Answering	TGIF-QA	Confidence Score	4.2	TS-LLaVA-34B
Video Question Answering	MSRVTT-QA	Accuracy	66.2	TS-LLaVA-34B
Video Question Answering	MSRVTT-QA	Confidence Score	3.6	TS-LLaVA-34B
Video Question Answering	IntentQA	Accuracy	67.9	TS-LLaVA-34B
Video Question Answering	EgoSchema (subset)	Accuracy	57.8	TS-LLaVA-34B
Video Question Answering	ActivityNet-QA	Accuracy	58.9	TS-LLaVA-34B
Video Question Answering	ActivityNet-QA	Confidence Score	3.5	TS-LLaVA-34B
Generative Visual Question Answering	VideoInstruct	mean	3.38	TS-LLaVA-34B
Generative Visual Question Answering	VideoInstruct	gpt-score	3.86	TS-LLaVA-34B
Generative Visual Question Answering	VideoInstruct	gpt-score	3.55	TS-LLaVA-34B
Generative Visual Question Answering	VideoInstruct	gpt-score	3.03	TS-LLaVA-34B
Generative Visual Question Answering	VideoInstruct	gpt-score	2.77	TS-LLaVA-34B
Generative Visual Question Answering	VideoInstruct	gpt-score	3.69	TS-LLaVA-34B
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	3.55	TS-LLaVA-34B
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.38	TS-LLaVA-34B
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.86	TS-LLaVA-34B
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.55	TS-LLaVA-34B
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.03	TS-LLaVA-34B
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.77	TS-LLaVA-34B
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.69	TS-LLaVA-34B

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Abstract

Results

Related Papers

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Abstract

Results

Related Papers