TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Part-aware Unified Representation of Language and Skeleton...

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

Anqi Zhu, Qiuhong Ke, Mingming Gong, James Bailey

2024-06-19CVPR 2024 1Skeleton Based Action RecognitionZero Shot Skeletal Action RecognitionZero-Shot Action RecognitionAction RecognitionZero-Shot Learning
PaperPDFCode(official)

Abstract

While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (10 unseen classes)71.95PURLS
VideoNTU RGB+D 120Accuracy (24 unseen classes)52.01PURLS
VideoNTU RGB+DAccuracy (12 unseen classes)40.99PURLS
VideoNTU RGB+DAccuracy (5 unseen classes)79.23PURLS
Temporal Action LocalizationNTU RGB+D 120Accuracy (10 unseen classes)71.95PURLS
Temporal Action LocalizationNTU RGB+D 120Accuracy (24 unseen classes)52.01PURLS
Temporal Action LocalizationNTU RGB+DAccuracy (12 unseen classes)40.99PURLS
Temporal Action LocalizationNTU RGB+DAccuracy (5 unseen classes)79.23PURLS
Zero-Shot LearningNTU RGB+D 120Accuracy (10 unseen classes)71.95PURLS
Zero-Shot LearningNTU RGB+D 120Accuracy (24 unseen classes)52.01PURLS
Zero-Shot LearningNTU RGB+DAccuracy (12 unseen classes)40.99PURLS
Zero-Shot LearningNTU RGB+DAccuracy (5 unseen classes)79.23PURLS
Activity RecognitionNTU RGB+D 120Accuracy (10 unseen classes)71.95PURLS
Activity RecognitionNTU RGB+D 120Accuracy (24 unseen classes)52.01PURLS
Activity RecognitionNTU RGB+DAccuracy (12 unseen classes)40.99PURLS
Activity RecognitionNTU RGB+DAccuracy (5 unseen classes)79.23PURLS
Action LocalizationNTU RGB+D 120Accuracy (10 unseen classes)71.95PURLS
Action LocalizationNTU RGB+D 120Accuracy (24 unseen classes)52.01PURLS
Action LocalizationNTU RGB+DAccuracy (12 unseen classes)40.99PURLS
Action LocalizationNTU RGB+DAccuracy (5 unseen classes)79.23PURLS
3D Action RecognitionNTU RGB+D 120Accuracy (10 unseen classes)71.95PURLS
3D Action RecognitionNTU RGB+D 120Accuracy (24 unseen classes)52.01PURLS
3D Action RecognitionNTU RGB+DAccuracy (12 unseen classes)40.99PURLS
3D Action RecognitionNTU RGB+DAccuracy (5 unseen classes)79.23PURLS
Action RecognitionNTU RGB+D 120Accuracy (10 unseen classes)71.95PURLS
Action RecognitionNTU RGB+D 120Accuracy (24 unseen classes)52.01PURLS
Action RecognitionNTU RGB+DAccuracy (12 unseen classes)40.99PURLS
Action RecognitionNTU RGB+DAccuracy (5 unseen classes)79.23PURLS

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation2025-07-14Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning2025-06-26Zero-Shot Learning for Obsolescence Risk Forecasting2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25