TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SkateFormer: Skeletal-Temporal Transformer for Human Actio...

SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

Jeonghyeok Do, Munchurl Kim

2024-03-14Skeleton Based Action RecognitionAction RecognitionHuman Interaction Recognition
PaperPDFCode(official)

Abstract

Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
VideoNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
VideoNTU RGB+D 120Ensembled Modalities4SkateFormer
VideoN-UCLAAccuracy98.3SkateFormer
VideoNTU RGB+DAccuracy (CS)93.5SkateFormer
VideoNTU RGB+DAccuracy (CV)97.8SkateFormer
VideoNTU RGB+DEnsembled Modalities4SkateFormer
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
Temporal Action LocalizationNTU RGB+D 120Ensembled Modalities4SkateFormer
Temporal Action LocalizationN-UCLAAccuracy98.3SkateFormer
Temporal Action LocalizationNTU RGB+DAccuracy (CS)93.5SkateFormer
Temporal Action LocalizationNTU RGB+DAccuracy (CV)97.8SkateFormer
Temporal Action LocalizationNTU RGB+DEnsembled Modalities4SkateFormer
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
Zero-Shot LearningNTU RGB+D 120Ensembled Modalities4SkateFormer
Zero-Shot LearningN-UCLAAccuracy98.3SkateFormer
Zero-Shot LearningNTU RGB+DAccuracy (CS)93.5SkateFormer
Zero-Shot LearningNTU RGB+DAccuracy (CV)97.8SkateFormer
Zero-Shot LearningNTU RGB+DEnsembled Modalities4SkateFormer
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
Activity RecognitionNTU RGB+D 120Ensembled Modalities4SkateFormer
Activity RecognitionN-UCLAAccuracy98.3SkateFormer
Activity RecognitionNTU RGB+DAccuracy (CS)93.5SkateFormer
Activity RecognitionNTU RGB+DAccuracy (CV)97.8SkateFormer
Activity RecognitionNTU RGB+DEnsembled Modalities4SkateFormer
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
Action LocalizationNTU RGB+D 120Ensembled Modalities4SkateFormer
Action LocalizationN-UCLAAccuracy98.3SkateFormer
Action LocalizationNTU RGB+DAccuracy (CS)93.5SkateFormer
Action LocalizationNTU RGB+DAccuracy (CV)97.8SkateFormer
Action LocalizationNTU RGB+DEnsembled Modalities4SkateFormer
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
Action DetectionNTU RGB+D 120Ensembled Modalities4SkateFormer
Action DetectionN-UCLAAccuracy98.3SkateFormer
Action DetectionNTU RGB+DAccuracy (CS)93.5SkateFormer
Action DetectionNTU RGB+DAccuracy (CV)97.8SkateFormer
Action DetectionNTU RGB+DEnsembled Modalities4SkateFormer
Human Interaction RecognitionNTU RGB+DAccuracy (Cross-Subject)97.1SkateFormer
Human Interaction RecognitionNTU RGB+DAccuracy (Cross-View)99.3SkateFormer
Human Interaction RecognitionNTU RGB+D 120Accuracy (Cross-Setup)93.2SkateFormer
Human Interaction RecognitionNTU RGB+D 120Accuracy (Cross-Subject)92.3SkateFormer
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
3D Action RecognitionNTU RGB+D 120Ensembled Modalities4SkateFormer
3D Action RecognitionN-UCLAAccuracy98.3SkateFormer
3D Action RecognitionNTU RGB+DAccuracy (CS)93.5SkateFormer
3D Action RecognitionNTU RGB+DAccuracy (CV)97.8SkateFormer
3D Action RecognitionNTU RGB+DEnsembled Modalities4SkateFormer
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.4SkateFormer
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)89.8SkateFormer
Action RecognitionNTU RGB+D 120Ensembled Modalities4SkateFormer
Action RecognitionN-UCLAAccuracy98.3SkateFormer
Action RecognitionNTU RGB+DAccuracy (CS)93.5SkateFormer
Action RecognitionNTU RGB+DAccuracy (CV)97.8SkateFormer
Action RecognitionNTU RGB+DEnsembled Modalities4SkateFormer

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16