TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TimeChat: A Time-sensitive Multimodal Large Language Model...

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu sun, Lu Hou

2023-12-04CVPR 2024 1Zero-Shot Video Question AnswerInstruction FollowingVideo-Text RetrievalMultimodal Large Language ModelVideo Question AnsweringHighlight DetectionLarge Language ModelTemporal LocalizationVideo UnderstandingDense CaptioningLanguage Modelling
PaperPDFCode(official)Code

Abstract

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.

Results

TaskDatasetMetricValueModel
VideoTest-of-Time2-Class Accuracy76.67Time-Chat
Question AnsweringEgoSchema (fullset)Accuracy33TimeChat (7B)
Video Question AnsweringOVBenchAVG12.8TimeChat (7B)
Video Question AnsweringMVBenchAvg.38.5TimeChat
Video Question AnsweringEgoSchema (fullset)Accuracy33TimeChat (7B)
Video RetrievalTest-of-Time2-Class Accuracy76.67Time-Chat

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17