TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Language Knowledge-Assisted Representation Learning for Sk...

Language Knowledge-Assisted Representation Learning for Skeleton-Based Action Recognition

Haojun Xu, Yan Gao, Zheng Hui, Jie Li, Xinbo Gao

2023-05-21GPRRepresentation LearningSkeleton Based Action RecognitionAction Recognition
PaperPDFCode(official)

Abstract

How humans understand and recognize the actions of others is a complex neuroscientific problem that involves a combination of cognitive mechanisms and neural networks. Research has shown that humans have brain areas that recognize actions that process top-down attentional information, such as the temporoparietal association area. Also, humans have brain regions dedicated to understanding the minds of others and analyzing their intentions, such as the medial prefrontal cortex of the temporal lobe. Skeleton-based action recognition creates mappings for the complex connections between the human skeleton movement patterns and behaviors. Although existing studies encoded meaningful node relationships and synthesized action representations for classification with good results, few of them considered incorporating a priori knowledge to aid potential representation learning for better performance. LA-GCN proposes a graph convolution network using large-scale language models (LLM) knowledge assistance. First, the LLM knowledge is mapped into a priori global relationship (GPR) topology and a priori category relationship (CPR) topology between nodes. The GPR guides the generation of new "bone" representations, aiming to emphasize essential node information from the data level. The CPR mapping simulates category prior knowledge in human brain regions, encoded by the PC-AC module and used to add additional supervision-forcing the model to learn class-distinguishable features. In addition, to improve information transfer efficiency in topology modeling, we propose multi-hop attention graph convolution. It aggregates each node's k-order neighbor simultaneously to speed up model convergence. LA-GCN reaches state-of-the-art on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
VideoNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
VideoNTU RGB+D 120Ensembled Modalities6LA-GCN
VideoN-UCLAAccuracy97.6LA-GCN
VideoNTU RGB+DAccuracy (CS)93.5LA-GCN
VideoNTU RGB+DAccuracy (CV)97.2LA-GCN
VideoNTU RGB+DEnsembled Modalities6LA-GCN
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
Temporal Action LocalizationNTU RGB+D 120Ensembled Modalities6LA-GCN
Temporal Action LocalizationN-UCLAAccuracy97.6LA-GCN
Temporal Action LocalizationNTU RGB+DAccuracy (CS)93.5LA-GCN
Temporal Action LocalizationNTU RGB+DAccuracy (CV)97.2LA-GCN
Temporal Action LocalizationNTU RGB+DEnsembled Modalities6LA-GCN
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
Zero-Shot LearningNTU RGB+D 120Ensembled Modalities6LA-GCN
Zero-Shot LearningN-UCLAAccuracy97.6LA-GCN
Zero-Shot LearningNTU RGB+DAccuracy (CS)93.5LA-GCN
Zero-Shot LearningNTU RGB+DAccuracy (CV)97.2LA-GCN
Zero-Shot LearningNTU RGB+DEnsembled Modalities6LA-GCN
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
Activity RecognitionNTU RGB+D 120Ensembled Modalities6LA-GCN
Activity RecognitionN-UCLAAccuracy97.6LA-GCN
Activity RecognitionNTU RGB+DAccuracy (CS)93.5LA-GCN
Activity RecognitionNTU RGB+DAccuracy (CV)97.2LA-GCN
Activity RecognitionNTU RGB+DEnsembled Modalities6LA-GCN
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
Action LocalizationNTU RGB+D 120Ensembled Modalities6LA-GCN
Action LocalizationN-UCLAAccuracy97.6LA-GCN
Action LocalizationNTU RGB+DAccuracy (CS)93.5LA-GCN
Action LocalizationNTU RGB+DAccuracy (CV)97.2LA-GCN
Action LocalizationNTU RGB+DEnsembled Modalities6LA-GCN
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
Action DetectionNTU RGB+D 120Ensembled Modalities6LA-GCN
Action DetectionN-UCLAAccuracy97.6LA-GCN
Action DetectionNTU RGB+DAccuracy (CS)93.5LA-GCN
Action DetectionNTU RGB+DAccuracy (CV)97.2LA-GCN
Action DetectionNTU RGB+DEnsembled Modalities6LA-GCN
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
3D Action RecognitionNTU RGB+D 120Ensembled Modalities6LA-GCN
3D Action RecognitionN-UCLAAccuracy97.6LA-GCN
3D Action RecognitionNTU RGB+DAccuracy (CS)93.5LA-GCN
3D Action RecognitionNTU RGB+DAccuracy (CV)97.2LA-GCN
3D Action RecognitionNTU RGB+DEnsembled Modalities6LA-GCN
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.8LA-GCN
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90.7LA-GCN
Action RecognitionNTU RGB+D 120Ensembled Modalities6LA-GCN
Action RecognitionN-UCLAAccuracy97.6LA-GCN
Action RecognitionNTU RGB+DAccuracy (CS)93.5LA-GCN
Action RecognitionNTU RGB+DAccuracy (CV)97.2LA-GCN
Action RecognitionNTU RGB+DEnsembled Modalities6LA-GCN

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Sample-Constrained Black Box Optimization for Audio Personalization2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16