TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hypergraph Transformer for Skeleton-based Action Recognition

Hypergraph Transformer for Skeleton-based Action Recognition

Yuxuan Zhou, Zhi-Qi Cheng, Chao Li, Yanwen Fang, Yifeng Geng, Xuansong Xie, Margret Keuper

2022-11-17Skeleton Based Action RecognitionAction Recognition
PaperPDFCode(official)

Abstract

Skeleton-based action recognition aims to recognize human actions given human joint coordinates with skeletal interconnections. By defining a graph with joints as vertices and their natural connections as edges, previous works successfully adopted Graph Convolutional networks (GCNs) to model joint co-occurrences and achieved superior performance. More recently, a limitation of GCNs is identified, i.e., the topology is fixed after training. To relax such a restriction, Self-Attention (SA) mechanism has been adopted to make the topology of GCNs adaptive to the input, resulting in the state-of-the-art hybrid models. Concurrently, attempts with plain Transformers have also been made, but they still lag behind state-of-the-art GCN-based methods due to the lack of structural prior. Unlike hybrid models, we propose a more elegant solution to incorporate the bone connectivity into Transformer via a graph distance embedding. Our embedding retains the information of skeletal structure during training, whereas GCNs merely use it for initialization. More importantly, we reveal an underlying issue of graph models in general, i.e., pairwise aggregation essentially ignores the high-order kinematic dependencies between body joints. To fill this gap, we propose a new self-attention (SA) mechanism on hypergraph, termed Hypergraph Self-Attention (HyperSA), to incorporate intrinsic higher-order relations into the model. We name the resulting model Hyperformer, and it beats state-of-the-art graph models w.r.t. accuracy and efficiency on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
VideoNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
VideoNTU RGB+D 120Ensembled Modalities4Hyperformer
VideoNTU RGB+DAccuracy (CS)92.9Hyperformer
VideoNTU RGB+DAccuracy (CV)96.5Hyperformer
VideoNTU RGB+DEnsembled Modalities4Hyperformer
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
Temporal Action LocalizationNTU RGB+D 120Ensembled Modalities4Hyperformer
Temporal Action LocalizationNTU RGB+DAccuracy (CS)92.9Hyperformer
Temporal Action LocalizationNTU RGB+DAccuracy (CV)96.5Hyperformer
Temporal Action LocalizationNTU RGB+DEnsembled Modalities4Hyperformer
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
Zero-Shot LearningNTU RGB+D 120Ensembled Modalities4Hyperformer
Zero-Shot LearningNTU RGB+DAccuracy (CS)92.9Hyperformer
Zero-Shot LearningNTU RGB+DAccuracy (CV)96.5Hyperformer
Zero-Shot LearningNTU RGB+DEnsembled Modalities4Hyperformer
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
Activity RecognitionNTU RGB+D 120Ensembled Modalities4Hyperformer
Activity RecognitionNTU RGB+DAccuracy (CS)92.9Hyperformer
Activity RecognitionNTU RGB+DAccuracy (CV)96.5Hyperformer
Activity RecognitionNTU RGB+DEnsembled Modalities4Hyperformer
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
Action LocalizationNTU RGB+D 120Ensembled Modalities4Hyperformer
Action LocalizationNTU RGB+DAccuracy (CS)92.9Hyperformer
Action LocalizationNTU RGB+DAccuracy (CV)96.5Hyperformer
Action LocalizationNTU RGB+DEnsembled Modalities4Hyperformer
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
Action DetectionNTU RGB+D 120Ensembled Modalities4Hyperformer
Action DetectionNTU RGB+DAccuracy (CS)92.9Hyperformer
Action DetectionNTU RGB+DAccuracy (CV)96.5Hyperformer
Action DetectionNTU RGB+DEnsembled Modalities4Hyperformer
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
3D Action RecognitionNTU RGB+D 120Ensembled Modalities4Hyperformer
3D Action RecognitionNTU RGB+DAccuracy (CS)92.9Hyperformer
3D Action RecognitionNTU RGB+DAccuracy (CV)96.5Hyperformer
3D Action RecognitionNTU RGB+DEnsembled Modalities4Hyperformer
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.3Hyperformer
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)89.9Hyperformer
Action RecognitionNTU RGB+D 120Ensembled Modalities4Hyperformer
Action RecognitionNTU RGB+DAccuracy (CS)92.9Hyperformer
Action RecognitionNTU RGB+DAccuracy (CV)96.5Hyperformer
Action RecognitionNTU RGB+DEnsembled Modalities4Hyperformer

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16