TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero...

TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition

Jeonghyeok Do, Munchurl Kim

2024-11-16Text MatchingSkeleton Based Action RecognitionZero Shot Skeletal Action RecognitionZero-Shot Action RecognitionAction RecognitionZero-Shot Learning
PaperPDFCode(official)

Abstract

We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs. In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated from the remarkable performance of text-to-image diffusion models, we leverage their alignment capabilities between different modalities mostly by focusing on the training process during reverse diffusion rather than using their generative power. Based on this, our framework is designed as a Triplet Diffusion for Skeleton-Text Matching (TDSM) method which aligns skeleton features with text prompts through reverse diffusion, embedding the prompts into the unified skeleton-text latent space to achieve robust matching. To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones. Our TDSM significantly outperforms the very recent state-of-the-art methods with large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (10 unseen classes)74.15TDSM
VideoNTU RGB+D 120Accuracy (24 unseen classes)65.06TDSM
VideoNTU RGB+D 120Random Split Accuracy69.47TDSM
VideoPKU-MMDRandom Split Accuracy70.76TDSM
VideoNTU RGB+DAccuracy (12 unseen classes)56.03TDSM
VideoNTU RGB+DAccuracy (5 unseen classes)86.49TDSM
VideoNTU RGB+DRandom Split Accuracy88.88TDSM
Temporal Action LocalizationNTU RGB+D 120Accuracy (10 unseen classes)74.15TDSM
Temporal Action LocalizationNTU RGB+D 120Accuracy (24 unseen classes)65.06TDSM
Temporal Action LocalizationNTU RGB+D 120Random Split Accuracy69.47TDSM
Temporal Action LocalizationPKU-MMDRandom Split Accuracy70.76TDSM
Temporal Action LocalizationNTU RGB+DAccuracy (12 unseen classes)56.03TDSM
Temporal Action LocalizationNTU RGB+DAccuracy (5 unseen classes)86.49TDSM
Temporal Action LocalizationNTU RGB+DRandom Split Accuracy88.88TDSM
Zero-Shot LearningNTU RGB+D 120Accuracy (10 unseen classes)74.15TDSM
Zero-Shot LearningNTU RGB+D 120Accuracy (24 unseen classes)65.06TDSM
Zero-Shot LearningNTU RGB+D 120Random Split Accuracy69.47TDSM
Zero-Shot LearningPKU-MMDRandom Split Accuracy70.76TDSM
Zero-Shot LearningNTU RGB+DAccuracy (12 unseen classes)56.03TDSM
Zero-Shot LearningNTU RGB+DAccuracy (5 unseen classes)86.49TDSM
Zero-Shot LearningNTU RGB+DRandom Split Accuracy88.88TDSM
Activity RecognitionNTU RGB+D 120Accuracy (10 unseen classes)74.15TDSM
Activity RecognitionNTU RGB+D 120Accuracy (24 unseen classes)65.06TDSM
Activity RecognitionNTU RGB+D 120Random Split Accuracy69.47TDSM
Activity RecognitionPKU-MMDRandom Split Accuracy70.76TDSM
Activity RecognitionNTU RGB+DAccuracy (12 unseen classes)56.03TDSM
Activity RecognitionNTU RGB+DAccuracy (5 unseen classes)86.49TDSM
Activity RecognitionNTU RGB+DRandom Split Accuracy88.88TDSM
Action LocalizationNTU RGB+D 120Accuracy (10 unseen classes)74.15TDSM
Action LocalizationNTU RGB+D 120Accuracy (24 unseen classes)65.06TDSM
Action LocalizationNTU RGB+D 120Random Split Accuracy69.47TDSM
Action LocalizationPKU-MMDRandom Split Accuracy70.76TDSM
Action LocalizationNTU RGB+DAccuracy (12 unseen classes)56.03TDSM
Action LocalizationNTU RGB+DAccuracy (5 unseen classes)86.49TDSM
Action LocalizationNTU RGB+DRandom Split Accuracy88.88TDSM
3D Action RecognitionNTU RGB+D 120Accuracy (10 unseen classes)74.15TDSM
3D Action RecognitionNTU RGB+D 120Accuracy (24 unseen classes)65.06TDSM
3D Action RecognitionNTU RGB+D 120Random Split Accuracy69.47TDSM
3D Action RecognitionPKU-MMDRandom Split Accuracy70.76TDSM
3D Action RecognitionNTU RGB+DAccuracy (12 unseen classes)56.03TDSM
3D Action RecognitionNTU RGB+DAccuracy (5 unseen classes)86.49TDSM
3D Action RecognitionNTU RGB+DRandom Split Accuracy88.88TDSM
Action RecognitionNTU RGB+D 120Accuracy (10 unseen classes)74.15TDSM
Action RecognitionNTU RGB+D 120Accuracy (24 unseen classes)65.06TDSM
Action RecognitionNTU RGB+D 120Random Split Accuracy69.47TDSM
Action RecognitionPKU-MMDRandom Split Accuracy70.76TDSM
Action RecognitionNTU RGB+DAccuracy (12 unseen classes)56.03TDSM
Action RecognitionNTU RGB+DAccuracy (5 unseen classes)86.49TDSM
Action RecognitionNTU RGB+DRandom Split Accuracy88.88TDSM

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation2025-07-14Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning2025-06-26Zero-Shot Learning for Obsolescence Risk Forecasting2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25