TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-Modality Co-Learning for Efficient Skeleton-based Ac...

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Jinfu Liu, Chen Chen, Mengyuan Liu

2024-07-22Skeleton Based Action RecognitionContrastive LearningAction Recognition
PaperPDFCode(official)

Abstract

Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
VideoNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
VideoNTU RGB+D 120Ensembled Modalities6MMCL
VideoN-UCLAAccuracy97.5MMCL
VideoNTU RGB+DAccuracy (CS)93.5MMCL
VideoNTU RGB+DAccuracy (CV)97.4MMCL
VideoNTU RGB+DEnsembled Modalities6MMCL
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
Temporal Action LocalizationNTU RGB+D 120Ensembled Modalities6MMCL
Temporal Action LocalizationN-UCLAAccuracy97.5MMCL
Temporal Action LocalizationNTU RGB+DAccuracy (CS)93.5MMCL
Temporal Action LocalizationNTU RGB+DAccuracy (CV)97.4MMCL
Temporal Action LocalizationNTU RGB+DEnsembled Modalities6MMCL
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
Zero-Shot LearningNTU RGB+D 120Ensembled Modalities6MMCL
Zero-Shot LearningN-UCLAAccuracy97.5MMCL
Zero-Shot LearningNTU RGB+DAccuracy (CS)93.5MMCL
Zero-Shot LearningNTU RGB+DAccuracy (CV)97.4MMCL
Zero-Shot LearningNTU RGB+DEnsembled Modalities6MMCL
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
Activity RecognitionNTU RGB+D 120Ensembled Modalities6MMCL
Activity RecognitionN-UCLAAccuracy97.5MMCL
Activity RecognitionNTU RGB+DAccuracy (CS)93.5MMCL
Activity RecognitionNTU RGB+DAccuracy (CV)97.4MMCL
Activity RecognitionNTU RGB+DEnsembled Modalities6MMCL
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
Action LocalizationNTU RGB+D 120Ensembled Modalities6MMCL
Action LocalizationN-UCLAAccuracy97.5MMCL
Action LocalizationNTU RGB+DAccuracy (CS)93.5MMCL
Action LocalizationNTU RGB+DAccuracy (CV)97.4MMCL
Action LocalizationNTU RGB+DEnsembled Modalities6MMCL
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
Action DetectionNTU RGB+D 120Ensembled Modalities6MMCL
Action DetectionN-UCLAAccuracy97.5MMCL
Action DetectionNTU RGB+DAccuracy (CS)93.5MMCL
Action DetectionNTU RGB+DAccuracy (CV)97.4MMCL
Action DetectionNTU RGB+DEnsembled Modalities6MMCL
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
3D Action RecognitionNTU RGB+D 120Ensembled Modalities6MMCL
3D Action RecognitionN-UCLAAccuracy97.5MMCL
3D Action RecognitionNTU RGB+DAccuracy (CS)93.5MMCL
3D Action RecognitionNTU RGB+DAccuracy (CV)97.4MMCL
3D Action RecognitionNTU RGB+DEnsembled Modalities6MMCL
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.7MMCL
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90.3MMCL
Action RecognitionNTU RGB+D 120Ensembled Modalities6MMCL
Action RecognitionN-UCLAAccuracy97.5MMCL
Action RecognitionNTU RGB+DAccuracy (CS)93.5MMCL
Action RecognitionNTU RGB+DAccuracy (CV)97.4MMCL
Action RecognitionNTU RGB+DEnsembled Modalities6MMCL

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15Latent Space Consistency for Sparse-View CT Reconstruction2025-07-15