TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Decoupled Spatial-Temporal Attention Network for Skeleton-...

Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action Recognition

Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu

2020-07-07Skeleton Based Action RecognitionAction RecognitionTemporal Action Localization
PaperPDFCode(official)

Abstract

Dynamic skeletal data, represented as the 2D/3D coordinates of human joints, has been widely studied for human action recognition due to its high-level semantic information and environmental robustness. However, previous methods heavily rely on designing hand-crafted traversal rules or graph topologies to draw dependencies between the joints, which are limited in performance and generalizability. In this work, we present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition. It involves solely the attention blocks, allowing for modeling spatial-temporal dependencies between joints without the requirement of knowing their positions or mutual connections. Specifically, to meet the specific requirements of the skeletal data, three techniques are proposed for building attention blocks, namely, spatial-temporal attention decoupling, decoupled position encoding and spatial global regularization. Besides, from the data aspect, we introduce a skeletal data decoupling technique to emphasize the specific characteristics of space/time and different motion scales, resulting in a more comprehensive understanding of the human actions.To test the effectiveness of the proposed method, extensive experiments are conducted on four challenging datasets for skeleton-based gesture and action recognition, namely, SHREC, DHG, NTU-60 and NTU-120, where DSTA-Net achieves state-of-the-art performance on all of them.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+DAccuracy (CS)91.5DSTA-Net
VideoNTU RGB+DAccuracy (CV)96.4DSTA-Net
Temporal Action LocalizationNTU RGB+DAccuracy (CS)91.5DSTA-Net
Temporal Action LocalizationNTU RGB+DAccuracy (CV)96.4DSTA-Net
Zero-Shot LearningNTU RGB+DAccuracy (CS)91.5DSTA-Net
Zero-Shot LearningNTU RGB+DAccuracy (CV)96.4DSTA-Net
Activity RecognitionNTU RGB+DAccuracy (CS)91.5DSTA-Net
Activity RecognitionNTU RGB+DAccuracy (CV)96.4DSTA-Net
Action LocalizationNTU RGB+DAccuracy (CS)91.5DSTA-Net
Action LocalizationNTU RGB+DAccuracy (CV)96.4DSTA-Net
Action DetectionNTU RGB+DAccuracy (CS)91.5DSTA-Net
Action DetectionNTU RGB+DAccuracy (CV)96.4DSTA-Net
3D Action RecognitionNTU RGB+DAccuracy (CS)91.5DSTA-Net
3D Action RecognitionNTU RGB+DAccuracy (CV)96.4DSTA-Net
Action RecognitionNTU RGB+DAccuracy (CS)91.5DSTA-Net
Action RecognitionNTU RGB+DAccuracy (CV)96.4DSTA-Net

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22