RTQ: Rethinking Video-language Understanding Based on Image-text Model

Xiao Wang, Yaoyu Li, Tian Gan, Zheng Zhang, Jingjing Lv, Liqiang Nie

2023-12-01Video Retrieval Video Question Answering Video Captioning

Abstract

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods. Code is available at https://github.com/SCZwangxiao/RTQ-MM2023.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	53.4	RTQ
Video	MSR-VTT-1kA	text-to-video R@10	84.4	RTQ
Video	MSR-VTT-1kA	text-to-video R@5	76.1	RTQ
Video	ActivityNet	text-to-video R@1	53.5	RTQ
Video	ActivityNet	text-to-video R@10	91.9	RTQ
Video	ActivityNet	text-to-video R@5	81.4	RTQ
Video	DiDeMo	text-to-video R@1	57.6	RTQ
Video	DiDeMo	text-to-video R@10	89.9	RTQ
Video	DiDeMo	text-to-video R@5	84.1	RTQ
Video Question Answering	NExT-QA	Accuracy	63.2	RTQ
Video Captioning	MSR-VTT	BLEU-4	49.6	RTQ
Video Captioning	MSR-VTT	CIDEr	69.3	RTQ
Video Captioning	MSR-VTT	ROUGE-L	66.1	RTQ
Video Captioning	MSVD	BLEU-4	66.9	RTQ
Video Captioning	MSVD	CIDEr	123.4	RTQ
Video Captioning	MSVD	ROUGE-L	82.2	RTQ
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	53.4	RTQ
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	84.4	RTQ
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	76.1	RTQ
Video Retrieval	ActivityNet	text-to-video R@1	53.5	RTQ
Video Retrieval	ActivityNet	text-to-video R@10	91.9	RTQ
Video Retrieval	ActivityNet	text-to-video R@5	81.4	RTQ
Video Retrieval	DiDeMo	text-to-video R@1	57.6	RTQ
Video Retrieval	DiDeMo	text-to-video R@10	89.9	RTQ
Video Retrieval	DiDeMo	text-to-video R@5	84.1	RTQ

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Abstract

Results

Related Papers

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Abstract

Results

Related Papers