TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TS-LLaVA: Constructing Visual Tokens through Thumbnail-and...

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens

2024-11-17Zero-Shot Video Question AnswerVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding
PaperPDFCode(official)

Abstract

Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at https://github.com/tingyu215/TS-LLaVA.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy73.6TS-LLaVA-34B
Question AnsweringMVBenchAccuracy52.6TS-LLaVA-34B
Question AnsweringMSVD-QAAccuracy79.4TS-LLaVA-34B
Question AnsweringMSVD-QAConfidence Score4.1TS-LLaVA-34B
Question AnsweringTGIF-QAAccuracy81TS-LLaVA-34B
Question AnsweringTGIF-QAConfidence Score4.2TS-LLaVA-34B
Question AnsweringMSRVTT-QAAccuracy66.2TS-LLaVA-34B
Question AnsweringMSRVTT-QAConfidence Score3.6TS-LLaVA-34B
Question AnsweringIntentQAAccuracy67.9TS-LLaVA-34B
Question AnsweringEgoSchema (subset)Accuracy57.8TS-LLaVA-34B
Question AnsweringActivityNet-QAAccuracy58.9TS-LLaVA-34B
Question AnsweringActivityNet-QAConfidence Score3.5TS-LLaVA-34B
Visual Question Answering (VQA)VideoInstructmean3.38TS-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score3.86TS-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score3.55TS-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score3.03TS-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score2.77TS-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score3.69TS-LLaVA-34B
Video Question AnsweringNExT-QAAccuracy73.6TS-LLaVA-34B
Video Question AnsweringMVBenchAccuracy52.6TS-LLaVA-34B
Video Question AnsweringMSVD-QAAccuracy79.4TS-LLaVA-34B
Video Question AnsweringMSVD-QAConfidence Score4.1TS-LLaVA-34B
Video Question AnsweringTGIF-QAAccuracy81TS-LLaVA-34B
Video Question AnsweringTGIF-QAConfidence Score4.2TS-LLaVA-34B
Video Question AnsweringMSRVTT-QAAccuracy66.2TS-LLaVA-34B
Video Question AnsweringMSRVTT-QAConfidence Score3.6TS-LLaVA-34B
Video Question AnsweringIntentQAAccuracy67.9TS-LLaVA-34B
Video Question AnsweringEgoSchema (subset)Accuracy57.8TS-LLaVA-34B
Video Question AnsweringActivityNet-QAAccuracy58.9TS-LLaVA-34B
Video Question AnsweringActivityNet-QAConfidence Score3.5TS-LLaVA-34B
Generative Visual Question AnsweringVideoInstructmean3.38TS-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score3.86TS-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score3.55TS-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score3.03TS-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score2.77TS-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score3.69TS-LLaVA-34B
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.55TS-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructmean3.38TS-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.86TS-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.55TS-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.03TS-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.77TS-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.69TS-LLaVA-34B

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08