TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Bidirectional Cross-Modal Knowledge Exploration for Video ...

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

2022-12-31CVPR 2023 1Action ClassificationAttributeVideo RecognitionZero-Shot Action RecognitionAction Recognition
PaperPDFCodeCodeCode(official)CodeCode

Abstract

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .

Results

TaskDatasetMetricValueModel
VideoCharadesMAP50.7BIKE
VideoKinetics-400Acc@188.7BIKE (CLIP ViT-L/14)
VideoKinetics-400Acc@598.4BIKE (CLIP ViT-L/14)
Activity RecognitionHMDB-51Average accuracy of 3 splits83.1BIKE
Activity RecognitionActivityNetmAP96.1BIKE
Activity RecognitionUCF1013-fold Accuracy98.8BIKE
Action RecognitionHMDB-51Average accuracy of 3 splits83.1BIKE
Action RecognitionActivityNetmAP96.1BIKE
Action RecognitionUCF1013-fold Accuracy98.8BIKE
Zero-Shot Action RecognitionUCF101Top-1 Accuracy86.6BIKE
Zero-Shot Action RecognitionKineticsTop-1 Accuracy68.5BIKE
Zero-Shot Action RecognitionKineticsTop-5 Accuracy91.1BIKE
Zero-Shot Action RecognitionHMDB51Top-1 Accuracy61.4BIKE
Zero-Shot Action RecognitionActivityNetTop-1 Accuracy86.2BIKE

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Non-Adaptive Adversarial Face Generation2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Attributes Shape the Embedding Space of Face Recognition Models2025-07-15COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation2025-07-15Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13Model Parallelism With Subnetwork Data Parallelism2025-07-11