Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Charades | MAP | 50.7 | BIKE |
| Video | Kinetics-400 | Acc@1 | 88.7 | BIKE (CLIP ViT-L/14) |
| Video | Kinetics-400 | Acc@5 | 98.4 | BIKE (CLIP ViT-L/14) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 83.1 | BIKE |
| Activity Recognition | ActivityNet | mAP | 96.1 | BIKE |
| Activity Recognition | UCF101 | 3-fold Accuracy | 98.8 | BIKE |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 83.1 | BIKE |
| Action Recognition | ActivityNet | mAP | 96.1 | BIKE |
| Action Recognition | UCF101 | 3-fold Accuracy | 98.8 | BIKE |
| Zero-Shot Action Recognition | UCF101 | Top-1 Accuracy | 86.6 | BIKE |
| Zero-Shot Action Recognition | Kinetics | Top-1 Accuracy | 68.5 | BIKE |
| Zero-Shot Action Recognition | Kinetics | Top-5 Accuracy | 91.1 | BIKE |
| Zero-Shot Action Recognition | HMDB51 | Top-1 Accuracy | 61.4 | BIKE |
| Zero-Shot Action Recognition | ActivityNet | Top-1 Accuracy | 86.2 | BIKE |