TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ST-LLM: Large Language Models Are Effective Temporal Learn...

ST-LLM: Large Language Models Are Effective Temporal Learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

2024-03-30Reading ComprehensionVideo-based Generative Performance BenchmarkingVideo Question AnsweringVideo Understanding
PaperPDFCode(official)

Abstract

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://github.com/TencentARC/ST-LLM.

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy74.6ST-LLM
Question AnsweringMSVD-QAConfidence Score3.9ST-LLM
Question AnsweringMSRVTT-QAAccuracy63.2ST-LLM
Question AnsweringMSRVTT-QAConfidence Score3.4ST-LLM
Question AnsweringActivityNet-QAAccuracy50.9ST-LLM
Question AnsweringActivityNet-QAConfidence Score3.3ST-LLM
Visual Question Answering (VQA)VideoInstructConsistency2.81ST-LLM-7B
Visual Question Answering (VQA)VideoInstructContextual Understanding3.74ST-LLM-7B
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.23ST-LLM-7B
Visual Question Answering (VQA)VideoInstructDetail Orientation3.05ST-LLM-7B
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.93ST-LLM-7B
Visual Question Answering (VQA)VideoInstructmean3.15ST-LLM-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.74ST-LLM
Visual Question Answering (VQA)VideoInstructgpt-score3.23ST-LLM
Visual Question Answering (VQA)VideoInstructgpt-score3.05ST-LLM
Visual Question Answering (VQA)VideoInstructgpt-score2.93ST-LLM
Visual Question Answering (VQA)VideoInstructgpt-score2.81ST-LLM
Video Question AnsweringTVBenchAverage Accuracy35.7ST-LLM
Video Question AnsweringMVBenchAvg.54.9ST-LLM
Video Question AnsweringMSVD-QAAccuracy74.6ST-LLM
Video Question AnsweringMSVD-QAConfidence Score3.9ST-LLM
Video Question AnsweringMSRVTT-QAAccuracy63.2ST-LLM
Video Question AnsweringMSRVTT-QAConfidence Score3.4ST-LLM
Video Question AnsweringActivityNet-QAAccuracy50.9ST-LLM
Video Question AnsweringActivityNet-QAConfidence Score3.3ST-LLM
Generative Visual Question AnsweringVideoInstructConsistency2.81ST-LLM-7B
Generative Visual Question AnsweringVideoInstructContextual Understanding3.74ST-LLM-7B
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.23ST-LLM-7B
Generative Visual Question AnsweringVideoInstructDetail Orientation3.05ST-LLM-7B
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.93ST-LLM-7B
Generative Visual Question AnsweringVideoInstructmean3.15ST-LLM-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.74ST-LLM
Generative Visual Question AnsweringVideoInstructgpt-score3.23ST-LLM
Generative Visual Question AnsweringVideoInstructgpt-score3.05ST-LLM
Generative Visual Question AnsweringVideoInstructgpt-score2.93ST-LLM
Generative Visual Question AnsweringVideoInstructgpt-score2.81ST-LLM
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.23ST-LLM
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.81ST-LLM-7B
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.74ST-LLM-7B
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.23ST-LLM-7B
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3.05ST-LLM-7B
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.93ST-LLM-7B
Video-based Generative Performance BenchmarkingVideoInstructmean3.15ST-LLM-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.74ST-LLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.23ST-LLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.05ST-LLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.93ST-LLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.81ST-LLM

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08