Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

2022-12-31CVPR 2023 1Action Classification Attribute Video Recognition Zero-Shot Action Recognition Action Recognition

Paper PDF Code Code Code(official)Code Code

Abstract

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .

Results

Task	Dataset	Metric	Value	Model
Video	Charades	MAP	50.7	BIKE
Video	Kinetics-400	Acc@1	88.7	BIKE (CLIP ViT-L/14)
Video	Kinetics-400	Acc@5	98.4	BIKE (CLIP ViT-L/14)
Activity Recognition	HMDB-51	Average accuracy of 3 splits	83.1	BIKE
Activity Recognition	ActivityNet	mAP	96.1	BIKE
Activity Recognition	UCF101	3-fold Accuracy	98.8	BIKE
Action Recognition	HMDB-51	Average accuracy of 3 splits	83.1	BIKE
Action Recognition	ActivityNet	mAP	96.1	BIKE
Action Recognition	UCF101	3-fold Accuracy	98.8	BIKE
Zero-Shot Action Recognition	UCF101	Top-1 Accuracy	86.6	BIKE
Zero-Shot Action Recognition	Kinetics	Top-1 Accuracy	68.5	BIKE
Zero-Shot Action Recognition	Kinetics	Top-5 Accuracy	91.1	BIKE
Zero-Shot Action Recognition	HMDB51	Top-1 Accuracy	61.4	BIKE
Zero-Shot Action Recognition	ActivityNet	Top-1 Accuracy	86.2	BIKE

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Abstract

Results

Related Papers

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Abstract

Results

Related Papers