TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/STEP CATFormer: Spatial-Temporal Effective Body-Part Cross...

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

Nguyen Huu Bao Long

2023-12-06Skeleton Based Action RecognitionAction Recognition
PaperPDFCode(official)

Abstract

Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
VideoNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
VideoNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
VideoNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
VideoNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
VideoNTU RGB+DEnsembled Modalities4STEP-CATFormer
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
Temporal Action LocalizationNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
Temporal Action LocalizationNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
Temporal Action LocalizationNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
Temporal Action LocalizationNTU RGB+DEnsembled Modalities4STEP-CATFormer
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
Zero-Shot LearningNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
Zero-Shot LearningNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
Zero-Shot LearningNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
Zero-Shot LearningNTU RGB+DEnsembled Modalities4STEP-CATFormer
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
Activity RecognitionNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
Activity RecognitionNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
Activity RecognitionNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
Activity RecognitionNTU RGB+DEnsembled Modalities4STEP-CATFormer
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
Action LocalizationNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
Action LocalizationNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
Action LocalizationNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
Action LocalizationNTU RGB+DEnsembled Modalities4STEP-CATFormer
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
Action DetectionNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
Action DetectionNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
Action DetectionNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
Action DetectionNTU RGB+DEnsembled Modalities4STEP-CATFormer
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
3D Action RecognitionNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
3D Action RecognitionNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
3D Action RecognitionNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
3D Action RecognitionNTU RGB+DEnsembled Modalities4STEP-CATFormer
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.2STEP-CATFormer
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90STEP-CATFormer
Action RecognitionNTU RGB+D 120Ensembled Modalities4STEP-CATFormer
Action RecognitionNTU RGB+DAccuracy (CS)93.2STEP-CATFormer
Action RecognitionNTU RGB+DAccuracy (CV)97.3STEP-CATFormer
Action RecognitionNTU RGB+DEnsembled Modalities4STEP-CATFormer

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16