TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Expanding Language-Image Pretrained Models for General Vid...

Expanding Language-Image Pretrained Models for General Video Recognition

Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling

2022-08-04Zero-shot GeneralizationAction ClassificationVideo RecognitionZero-Shot Action RecognitionAction Recognition
PaperPDFCode(official)Code(official)

Abstract

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://aka.ms/X-CLIP

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@187.7X-CLIP(ViT-L/14, CLIP)
VideoKinetics-400Acc@597.4X-CLIP(ViT-L/14, CLIP)
VideoKinetics-600Top-1 Accuracy88.3X-CLIP(ViT-L/14, CLIP)
VideoKinetics-600Top-5 Accuracy97.7X-CLIP(ViT-L/14, CLIP)
Zero-Shot Action RecognitionUCF101Top-1 Accuracy72X-CLIP
Zero-Shot Action RecognitionKineticsTop-1 Accuracy65.2X-CLIP
Zero-Shot Action RecognitionKineticsTop-5 Accuracy86.1X-CLIP
Zero-Shot Action RecognitionHMDB51Top-1 Accuracy44.6X-CLIP

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation2025-07-15PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment2025-07-12Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data2025-07-09Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach2025-07-04