TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TimeSuite: Improving MLLMs for Long Video Understanding vi...

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, LiMin Wang

2024-10-25Zero-Shot Video Question AnswerHallucinationVideo Question AnsweringHighlight DetectionMoment RetrievalVideo Understanding
PaperPDFCode

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.

Results

TaskDatasetMetricValueModel
Question AnsweringVideo-MME (w/o subs)Accuracy (%)46.3VideoChat-T (7B)
Question AnsweringVideo-MMEAccuracy (%)55.8VideoChat-T (7B)
Question AnsweringEgoSchema (fullset)Accuracy60VideoChat-T (7B)
Question AnsweringEgoSchema (subset)Accuracy68.4VideoChat-T (7B)
Video Question AnsweringMVBenchAvg.59.9VideoChat-T (7B)
Video Question AnsweringVideo-MME (w/o subs)Accuracy (%)46.3VideoChat-T (7B)
Video Question AnsweringVideo-MMEAccuracy (%)55.8VideoChat-T (7B)
Video Question AnsweringEgoSchema (fullset)Accuracy60VideoChat-T (7B)
Video Question AnsweringEgoSchema (subset)Accuracy68.4VideoChat-T (7B)
Moment RetrievalCharades-STAR@1 IoU=0.567.1VideoChat-T (FT)
Moment RetrievalCharades-STAR@1 IoU=0.743VideoChat-T (FT)
Moment RetrievalCharades-STAR@1 IoU=0.548.7VideoChat-T (ZS)
Moment RetrievalCharades-STAR@1 IoU=0.724VideoChat-T (ZS)
Moment RetrievalCharades-STAmIoU45.43VideoChat-T (ZS)
Highlight DetectionQVHighlightsHit@155.3VideoChat-T (FT)
Highlight DetectionQVHighlightsmAP27VideoChat-T (FT)
16kQVHighlightsHit@155.3VideoChat-T (FT)
16kQVHighlightsmAP27VideoChat-T (FT)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11UQLM: A Python Package for Uncertainty Quantification in Large Language Models2025-07-08Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08