LITA: Language Instructed Temporal-Localization Assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz

2024-03-27Instruction Following Text Generation Video-based Generative Performance Benchmarking Video Question Answering Temporal Localization

Paper PDF Code(official)

Abstract

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	VideoInstruct	Consistency	3.19	LITA-13B
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.43	LITA-13B
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	2.94	LITA-13B
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.98	LITA-13B
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.68	LITA-13B
Visual Question Answering (VQA)	VideoInstruct	mean	3.04	LITA-13B
Video Question Answering	OVBench	AVG	20.4	LITA (7B)
Generative Visual Question Answering	VideoInstruct	Consistency	3.19	LITA-13B
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.43	LITA-13B
Generative Visual Question Answering	VideoInstruct	Correctness of Information	2.94	LITA-13B
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.98	LITA-13B
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.68	LITA-13B
Generative Visual Question Answering	VideoInstruct	mean	3.04	LITA-13B
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	3.19	LITA-13B
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.43	LITA-13B
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	2.94	LITA-13B
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.98	LITA-13B
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.68	LITA-13B
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.04	LITA-13B

LITA: Language Instructed Temporal-Localization Assistant

Abstract

Results

Related Papers

LITA: Language Instructed Temporal-Localization Assistant

Abstract

Results

Related Papers