Yujie Zhou, Wenwen Qiang, Anyi Rao, Ning Lin, Bing Su, Jiaqi Wang
Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method. Code: https://github.com/YujieOuO/SMIE.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (10 unseen classes) | 65.74 | SMIE |
| Video | NTU RGB+D 120 | Accuracy (24 unseen classes) | 45.3 | SMIE |
| Video | NTU RGB+D 120 | Random Split Accuracy | 46.4 | SMIE |
| Video | PKU-MMD | Random Split Accuracy | 60.83 | SMIE |
| Video | NTU RGB+D | Accuracy (12 unseen classes) | 40.18 | SMIE |
| Video | NTU RGB+D | Accuracy (5 unseen classes) | 77.98 | SMIE |
| Video | NTU RGB+D | Random Split Accuracy | 65.08 | SMIE |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 65.74 | SMIE |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 45.3 | SMIE |
| Temporal Action Localization | NTU RGB+D 120 | Random Split Accuracy | 46.4 | SMIE |
| Temporal Action Localization | PKU-MMD | Random Split Accuracy | 60.83 | SMIE |
| Temporal Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 40.18 | SMIE |
| Temporal Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 77.98 | SMIE |
| Temporal Action Localization | NTU RGB+D | Random Split Accuracy | 65.08 | SMIE |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (10 unseen classes) | 65.74 | SMIE |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (24 unseen classes) | 45.3 | SMIE |
| Zero-Shot Learning | NTU RGB+D 120 | Random Split Accuracy | 46.4 | SMIE |
| Zero-Shot Learning | PKU-MMD | Random Split Accuracy | 60.83 | SMIE |
| Zero-Shot Learning | NTU RGB+D | Accuracy (12 unseen classes) | 40.18 | SMIE |
| Zero-Shot Learning | NTU RGB+D | Accuracy (5 unseen classes) | 77.98 | SMIE |
| Zero-Shot Learning | NTU RGB+D | Random Split Accuracy | 65.08 | SMIE |
| Activity Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 65.74 | SMIE |
| Activity Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 45.3 | SMIE |
| Activity Recognition | NTU RGB+D 120 | Random Split Accuracy | 46.4 | SMIE |
| Activity Recognition | PKU-MMD | Random Split Accuracy | 60.83 | SMIE |
| Activity Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 40.18 | SMIE |
| Activity Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 77.98 | SMIE |
| Activity Recognition | NTU RGB+D | Random Split Accuracy | 65.08 | SMIE |
| Action Localization | NTU RGB+D 120 | Accuracy (10 unseen classes) | 65.74 | SMIE |
| Action Localization | NTU RGB+D 120 | Accuracy (24 unseen classes) | 45.3 | SMIE |
| Action Localization | NTU RGB+D 120 | Random Split Accuracy | 46.4 | SMIE |
| Action Localization | PKU-MMD | Random Split Accuracy | 60.83 | SMIE |
| Action Localization | NTU RGB+D | Accuracy (12 unseen classes) | 40.18 | SMIE |
| Action Localization | NTU RGB+D | Accuracy (5 unseen classes) | 77.98 | SMIE |
| Action Localization | NTU RGB+D | Random Split Accuracy | 65.08 | SMIE |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 65.74 | SMIE |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 45.3 | SMIE |
| 3D Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 46.4 | SMIE |
| 3D Action Recognition | PKU-MMD | Random Split Accuracy | 60.83 | SMIE |
| 3D Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 40.18 | SMIE |
| 3D Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 77.98 | SMIE |
| 3D Action Recognition | NTU RGB+D | Random Split Accuracy | 65.08 | SMIE |
| Action Recognition | NTU RGB+D 120 | Accuracy (10 unseen classes) | 65.74 | SMIE |
| Action Recognition | NTU RGB+D 120 | Accuracy (24 unseen classes) | 45.3 | SMIE |
| Action Recognition | NTU RGB+D 120 | Random Split Accuracy | 46.4 | SMIE |
| Action Recognition | PKU-MMD | Random Split Accuracy | 60.83 | SMIE |
| Action Recognition | NTU RGB+D | Accuracy (12 unseen classes) | 40.18 | SMIE |
| Action Recognition | NTU RGB+D | Accuracy (5 unseen classes) | 77.98 | SMIE |
| Action Recognition | NTU RGB+D | Random Split Accuracy | 65.08 | SMIE |