TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LITA: Language Instructed Temporal-Localization Assistant

LITA: Language Instructed Temporal-Localization Assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz

2024-03-27Instruction FollowingText GenerationVideo-based Generative Performance BenchmarkingVideo Question AnsweringTemporal Localization
PaperPDFCode(official)

Abstract

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VideoInstructConsistency3.19LITA-13B
Visual Question Answering (VQA)VideoInstructContextual Understanding3.43LITA-13B
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.94LITA-13B
Visual Question Answering (VQA)VideoInstructDetail Orientation2.98LITA-13B
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.68LITA-13B
Visual Question Answering (VQA)VideoInstructmean3.04LITA-13B
Video Question AnsweringOVBenchAVG20.4LITA (7B)
Generative Visual Question AnsweringVideoInstructConsistency3.19LITA-13B
Generative Visual Question AnsweringVideoInstructContextual Understanding3.43LITA-13B
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.94LITA-13B
Generative Visual Question AnsweringVideoInstructDetail Orientation2.98LITA-13B
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.68LITA-13B
Generative Visual Question AnsweringVideoInstructmean3.04LITA-13B
Video-based Generative Performance BenchmarkingVideoInstructConsistency3.19LITA-13B
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.43LITA-13B
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.94LITA-13B
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.98LITA-13B
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.68LITA-13B
Video-based Generative Performance BenchmarkingVideoInstructmean3.04LITA-13B

Related Papers

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16How Many Instructions Can LLMs Follow at Once?2025-07-15DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering2025-07-15The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15