Jinfu Liu, Chen Chen, Mengyuan Liu
Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal large language models (LLMs) as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| Video | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| Video | N-UCLA | Accuracy | 97.5 | MMCL |
| Video | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| Video | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| Video | NTU RGB+D | Ensembled Modalities | 6 | MMCL |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| Temporal Action Localization | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| Temporal Action Localization | N-UCLA | Accuracy | 97.5 | MMCL |
| Temporal Action Localization | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| Temporal Action Localization | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| Temporal Action Localization | NTU RGB+D | Ensembled Modalities | 6 | MMCL |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| Zero-Shot Learning | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| Zero-Shot Learning | N-UCLA | Accuracy | 97.5 | MMCL |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| Zero-Shot Learning | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| Zero-Shot Learning | NTU RGB+D | Ensembled Modalities | 6 | MMCL |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| Activity Recognition | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| Activity Recognition | N-UCLA | Accuracy | 97.5 | MMCL |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| Activity Recognition | NTU RGB+D | Ensembled Modalities | 6 | MMCL |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| Action Localization | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| Action Localization | N-UCLA | Accuracy | 97.5 | MMCL |
| Action Localization | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| Action Localization | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| Action Localization | NTU RGB+D | Ensembled Modalities | 6 | MMCL |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| Action Detection | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| Action Detection | N-UCLA | Accuracy | 97.5 | MMCL |
| Action Detection | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| Action Detection | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| Action Detection | NTU RGB+D | Ensembled Modalities | 6 | MMCL |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| 3D Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| 3D Action Recognition | N-UCLA | Accuracy | 97.5 | MMCL |
| 3D Action Recognition | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| 3D Action Recognition | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| 3D Action Recognition | NTU RGB+D | Ensembled Modalities | 6 | MMCL |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | MMCL |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.3 | MMCL |
| Action Recognition | NTU RGB+D 120 | Ensembled Modalities | 6 | MMCL |
| Action Recognition | N-UCLA | Accuracy | 97.5 | MMCL |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 93.5 | MMCL |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 97.4 | MMCL |
| Action Recognition | NTU RGB+D | Ensembled Modalities | 6 | MMCL |