TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ActionCLIP: A New Paradigm for Video Action Recognition

ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang, Jiazheng Xing, Yong liu

2021-09-17Action ClassificationText MatchingPrompt EngineeringZero-Shot Action RecognitionAction RecognitionAction Recognition In VideosTemporal Action Localization
PaperPDFCode(official)Code

Abstract

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git

Results

TaskDatasetMetricValueModel
VideoCharadesMAP44.3ActionCLIP (ViT-B/16)
VideoKinetics-400Acc@183.8ActionCLIP (CLIP-pretrained)
VideoKinetics-400Acc@597.1ActionCLIP (CLIP-pretrained)
Activity RecognitionKinetics-400Top-1 Accuracy83.8ActionCLIP (ViT-B/16)
Action RecognitionKinetics-400Top-1 Accuracy83.8ActionCLIP (ViT-B/16)
Action Recognition In VideosKinetics-400Top-1 Accuracy83.8ActionCLIP (ViT-B/16)

Related Papers

Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges2025-07-13AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework2025-07-08Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach2025-07-04Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01