TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recogni...

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Sheng-Wei Li, Zi-Xiang Wei, Wei-Jie Chen, Yi-Hsin Yu, Chih-Yuan Yang, Jane Yung-jen Hsu

2024-07-18Skeleton Based Action RecognitionDisentanglementZero Shot Skeletal Action RecognitionGeneralized Zero Shot skeletal action recognitionAction Recognition
PaperPDFCode(official)

Abstract

Existing zero-shot skeleton-based action recognition methods utilize projection networks to learn a shared latent space of skeleton features and semantic embeddings. The inherent imbalance in action recognition datasets, characterized by variable skeleton sequences yet constant class labels, presents significant challenges for alignment. To address the imbalance, we propose SA-DVAE -- Semantic Alignment via Disentangled Variational Autoencoders, a method that first adopts feature disentanglement to separate skeleton features into two independent parts -- one is semantic-related and another is irrelevant -- to better align skeleton and semantic features. We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty. We conduct experiments on three benchmark datasets: NTU RGB+D, NTU RGB+D 120 and PKU-MMD, and our experimental results show that SA-DAVE produces improved performance over existing methods. The code is available at https://github.com/pha123661/SA-DVAE.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (10 unseen classes)68.77SA-DVAE
VideoNTU RGB+D 120Accuracy (24 unseen classes)46.12SA-DVAE
VideoNTU RGB+D 120Random Split Accuracy50.67SA-DVAE
VideoNTU RGB+D 120Random Split Accuracy57.16SA-DVAE + augmented text
VideoPKU-MMDRandom Split Accuracy66.54SA-DVAE
VideoNTU RGB+DAccuracy (12 unseen classes)41.38SA-DVAE
VideoNTU RGB+DAccuracy (5 unseen classes)82.37SA-DVAE
VideoNTU RGB+DRandom Split Accuracy84.2SA-DVAE
VideoNTU RGB+DRandom Split Accuracy87.61SA-DVAE + augmented text
VideoNTU RGB+DHarmonic Mean (12 unseen classes)42.56SA-DVAE
VideoNTU RGB+DHarmonic Mean (5 unseen classes)66.27SA-DVAE
VideoNTU RGB+DRandom Split Harmonic Mean75.27SA-DVAE
VideoNTU RGB+DRandom Split Harmonic Mean75.51SA-DVAE + augmented text
VideoNTU RGB+D 120Harmonic Mean (10 unseen classes)60.42SA-DVAE
VideoNTU RGB+D 120Harmonic Mean (24 unseen classes)44.5SA-DVAE
VideoNTU RGB+D 120Random Split Harmonic Mean47.54SA-DVAE
VideoNTU RGB+D 120Random Split Harmonic Mean50.72SA-DVAE + augmented text
VideoPKU-MMDRandom Split Harmonic Mean54.72SA-DVAE
Temporal Action LocalizationNTU RGB+D 120Accuracy (10 unseen classes)68.77SA-DVAE
Temporal Action LocalizationNTU RGB+D 120Accuracy (24 unseen classes)46.12SA-DVAE
Temporal Action LocalizationNTU RGB+D 120Random Split Accuracy50.67SA-DVAE
Temporal Action LocalizationNTU RGB+D 120Random Split Accuracy57.16SA-DVAE + augmented text
Temporal Action LocalizationPKU-MMDRandom Split Accuracy66.54SA-DVAE
Temporal Action LocalizationNTU RGB+DAccuracy (12 unseen classes)41.38SA-DVAE
Temporal Action LocalizationNTU RGB+DAccuracy (5 unseen classes)82.37SA-DVAE
Temporal Action LocalizationNTU RGB+DRandom Split Accuracy84.2SA-DVAE
Temporal Action LocalizationNTU RGB+DRandom Split Accuracy87.61SA-DVAE + augmented text
Temporal Action LocalizationNTU RGB+DHarmonic Mean (12 unseen classes)42.56SA-DVAE
Temporal Action LocalizationNTU RGB+DHarmonic Mean (5 unseen classes)66.27SA-DVAE
Temporal Action LocalizationNTU RGB+DRandom Split Harmonic Mean75.27SA-DVAE
Temporal Action LocalizationNTU RGB+DRandom Split Harmonic Mean75.51SA-DVAE + augmented text
Temporal Action LocalizationNTU RGB+D 120Harmonic Mean (10 unseen classes)60.42SA-DVAE
Temporal Action LocalizationNTU RGB+D 120Harmonic Mean (24 unseen classes)44.5SA-DVAE
Temporal Action LocalizationNTU RGB+D 120Random Split Harmonic Mean47.54SA-DVAE
Temporal Action LocalizationNTU RGB+D 120Random Split Harmonic Mean50.72SA-DVAE + augmented text
Temporal Action LocalizationPKU-MMDRandom Split Harmonic Mean54.72SA-DVAE
Zero-Shot LearningNTU RGB+D 120Accuracy (10 unseen classes)68.77SA-DVAE
Zero-Shot LearningNTU RGB+D 120Accuracy (24 unseen classes)46.12SA-DVAE
Zero-Shot LearningNTU RGB+D 120Random Split Accuracy50.67SA-DVAE
Zero-Shot LearningNTU RGB+D 120Random Split Accuracy57.16SA-DVAE + augmented text
Zero-Shot LearningPKU-MMDRandom Split Accuracy66.54SA-DVAE
Zero-Shot LearningNTU RGB+DAccuracy (12 unseen classes)41.38SA-DVAE
Zero-Shot LearningNTU RGB+DAccuracy (5 unseen classes)82.37SA-DVAE
Zero-Shot LearningNTU RGB+DRandom Split Accuracy84.2SA-DVAE
Zero-Shot LearningNTU RGB+DRandom Split Accuracy87.61SA-DVAE + augmented text
Zero-Shot LearningNTU RGB+DHarmonic Mean (12 unseen classes)42.56SA-DVAE
Zero-Shot LearningNTU RGB+DHarmonic Mean (5 unseen classes)66.27SA-DVAE
Zero-Shot LearningNTU RGB+DRandom Split Harmonic Mean75.27SA-DVAE
Zero-Shot LearningNTU RGB+DRandom Split Harmonic Mean75.51SA-DVAE + augmented text
Zero-Shot LearningNTU RGB+D 120Harmonic Mean (10 unseen classes)60.42SA-DVAE
Zero-Shot LearningNTU RGB+D 120Harmonic Mean (24 unseen classes)44.5SA-DVAE
Zero-Shot LearningNTU RGB+D 120Random Split Harmonic Mean47.54SA-DVAE
Zero-Shot LearningNTU RGB+D 120Random Split Harmonic Mean50.72SA-DVAE + augmented text
Zero-Shot LearningPKU-MMDRandom Split Harmonic Mean54.72SA-DVAE
Activity RecognitionNTU RGB+D 120Accuracy (10 unseen classes)68.77SA-DVAE
Activity RecognitionNTU RGB+D 120Accuracy (24 unseen classes)46.12SA-DVAE
Activity RecognitionNTU RGB+D 120Random Split Accuracy50.67SA-DVAE
Activity RecognitionNTU RGB+D 120Random Split Accuracy57.16SA-DVAE + augmented text
Activity RecognitionPKU-MMDRandom Split Accuracy66.54SA-DVAE
Activity RecognitionNTU RGB+DAccuracy (12 unseen classes)41.38SA-DVAE
Activity RecognitionNTU RGB+DAccuracy (5 unseen classes)82.37SA-DVAE
Activity RecognitionNTU RGB+DRandom Split Accuracy84.2SA-DVAE
Activity RecognitionNTU RGB+DRandom Split Accuracy87.61SA-DVAE + augmented text
Activity RecognitionNTU RGB+DHarmonic Mean (12 unseen classes)42.56SA-DVAE
Activity RecognitionNTU RGB+DHarmonic Mean (5 unseen classes)66.27SA-DVAE
Activity RecognitionNTU RGB+DRandom Split Harmonic Mean75.27SA-DVAE
Activity RecognitionNTU RGB+DRandom Split Harmonic Mean75.51SA-DVAE + augmented text
Activity RecognitionNTU RGB+D 120Harmonic Mean (10 unseen classes)60.42SA-DVAE
Activity RecognitionNTU RGB+D 120Harmonic Mean (24 unseen classes)44.5SA-DVAE
Activity RecognitionNTU RGB+D 120Random Split Harmonic Mean47.54SA-DVAE
Activity RecognitionNTU RGB+D 120Random Split Harmonic Mean50.72SA-DVAE + augmented text
Activity RecognitionPKU-MMDRandom Split Harmonic Mean54.72SA-DVAE
Action LocalizationNTU RGB+D 120Accuracy (10 unseen classes)68.77SA-DVAE
Action LocalizationNTU RGB+D 120Accuracy (24 unseen classes)46.12SA-DVAE
Action LocalizationNTU RGB+D 120Random Split Accuracy50.67SA-DVAE
Action LocalizationNTU RGB+D 120Random Split Accuracy57.16SA-DVAE + augmented text
Action LocalizationPKU-MMDRandom Split Accuracy66.54SA-DVAE
Action LocalizationNTU RGB+DAccuracy (12 unseen classes)41.38SA-DVAE
Action LocalizationNTU RGB+DAccuracy (5 unseen classes)82.37SA-DVAE
Action LocalizationNTU RGB+DRandom Split Accuracy84.2SA-DVAE
Action LocalizationNTU RGB+DRandom Split Accuracy87.61SA-DVAE + augmented text
Action LocalizationNTU RGB+DHarmonic Mean (12 unseen classes)42.56SA-DVAE
Action LocalizationNTU RGB+DHarmonic Mean (5 unseen classes)66.27SA-DVAE
Action LocalizationNTU RGB+DRandom Split Harmonic Mean75.27SA-DVAE
Action LocalizationNTU RGB+DRandom Split Harmonic Mean75.51SA-DVAE + augmented text
Action LocalizationNTU RGB+D 120Harmonic Mean (10 unseen classes)60.42SA-DVAE
Action LocalizationNTU RGB+D 120Harmonic Mean (24 unseen classes)44.5SA-DVAE
Action LocalizationNTU RGB+D 120Random Split Harmonic Mean47.54SA-DVAE
Action LocalizationNTU RGB+D 120Random Split Harmonic Mean50.72SA-DVAE + augmented text
Action LocalizationPKU-MMDRandom Split Harmonic Mean54.72SA-DVAE
3D Action RecognitionNTU RGB+D 120Accuracy (10 unseen classes)68.77SA-DVAE
3D Action RecognitionNTU RGB+D 120Accuracy (24 unseen classes)46.12SA-DVAE
3D Action RecognitionNTU RGB+D 120Random Split Accuracy50.67SA-DVAE
3D Action RecognitionNTU RGB+D 120Random Split Accuracy57.16SA-DVAE + augmented text
3D Action RecognitionPKU-MMDRandom Split Accuracy66.54SA-DVAE
3D Action RecognitionNTU RGB+DAccuracy (12 unseen classes)41.38SA-DVAE
3D Action RecognitionNTU RGB+DAccuracy (5 unseen classes)82.37SA-DVAE
3D Action RecognitionNTU RGB+DRandom Split Accuracy84.2SA-DVAE
3D Action RecognitionNTU RGB+DRandom Split Accuracy87.61SA-DVAE + augmented text
3D Action RecognitionNTU RGB+DHarmonic Mean (12 unseen classes)42.56SA-DVAE
3D Action RecognitionNTU RGB+DHarmonic Mean (5 unseen classes)66.27SA-DVAE
3D Action RecognitionNTU RGB+DRandom Split Harmonic Mean75.27SA-DVAE
3D Action RecognitionNTU RGB+DRandom Split Harmonic Mean75.51SA-DVAE + augmented text
3D Action RecognitionNTU RGB+D 120Harmonic Mean (10 unseen classes)60.42SA-DVAE
3D Action RecognitionNTU RGB+D 120Harmonic Mean (24 unseen classes)44.5SA-DVAE
3D Action RecognitionNTU RGB+D 120Random Split Harmonic Mean47.54SA-DVAE
3D Action RecognitionNTU RGB+D 120Random Split Harmonic Mean50.72SA-DVAE + augmented text
3D Action RecognitionPKU-MMDRandom Split Harmonic Mean54.72SA-DVAE
Action RecognitionNTU RGB+D 120Accuracy (10 unseen classes)68.77SA-DVAE
Action RecognitionNTU RGB+D 120Accuracy (24 unseen classes)46.12SA-DVAE
Action RecognitionNTU RGB+D 120Random Split Accuracy50.67SA-DVAE
Action RecognitionNTU RGB+D 120Random Split Accuracy57.16SA-DVAE + augmented text
Action RecognitionPKU-MMDRandom Split Accuracy66.54SA-DVAE
Action RecognitionNTU RGB+DAccuracy (12 unseen classes)41.38SA-DVAE
Action RecognitionNTU RGB+DAccuracy (5 unseen classes)82.37SA-DVAE
Action RecognitionNTU RGB+DRandom Split Accuracy84.2SA-DVAE
Action RecognitionNTU RGB+DRandom Split Accuracy87.61SA-DVAE + augmented text
Action RecognitionNTU RGB+DHarmonic Mean (12 unseen classes)42.56SA-DVAE
Action RecognitionNTU RGB+DHarmonic Mean (5 unseen classes)66.27SA-DVAE
Action RecognitionNTU RGB+DRandom Split Harmonic Mean75.27SA-DVAE
Action RecognitionNTU RGB+DRandom Split Harmonic Mean75.51SA-DVAE + augmented text
Action RecognitionNTU RGB+D 120Harmonic Mean (10 unseen classes)60.42SA-DVAE
Action RecognitionNTU RGB+D 120Harmonic Mean (24 unseen classes)44.5SA-DVAE
Action RecognitionNTU RGB+D 120Random Split Harmonic Mean47.54SA-DVAE
Action RecognitionNTU RGB+D 120Random Split Harmonic Mean50.72SA-DVAE + augmented text
Action RecognitionPKU-MMDRandom Split Harmonic Mean54.72SA-DVAE

Related Papers

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models2025-07-18A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Towards Imperceptible JPEG Image Hiding: Multi-range Representations-driven Adversarial Stego Generation2025-07-11Generative Head-Mounted Camera Captures for Photorealistic Avatars2025-07-08Reflections Unlock: Geometry-Aware Reflection Disentanglement in 3D Gaussian Splatting for Photorealistic Scenes Rendering2025-07-08Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations2025-07-04Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation2025-07-04Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization2025-07-03