Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, Monique Thonnat
In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| Video | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| Video | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |
| Video | Toyota Smarthome dataset | CS | 60.8 | VPN (RGB + Pose) |
| Video | Toyota Smarthome dataset | CV1 | 43.8 | VPN (RGB + Pose) |
| Video | Toyota Smarthome dataset | CV2 | 53.5 | VPN (RGB + Pose) |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| Temporal Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| Temporal Action Localization | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| Zero-Shot Learning | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| Zero-Shot Learning | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |
| Activity Recognition | NTU RGB+D | Accuracy (CS) | 95.5 | VPN (RGB + Pose) |
| Activity Recognition | NTU RGB+D | Accuracy (CV) | 98 | VPN (RGB + Pose) |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 86.3 | VPN (RGB + Pose) |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.8 | VPN (RGB + Pose) |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| Activity Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| Activity Recognition | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| Action Localization | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| Action Localization | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| Action Detection | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| Action Detection | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| 3D Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| 3D Action Recognition | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |
| Action Recognition | NTU RGB+D | Accuracy (CS) | 95.5 | VPN (RGB + Pose) |
| Action Recognition | NTU RGB+D | Accuracy (CV) | 98 | VPN (RGB + Pose) |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 86.3 | VPN (RGB + Pose) |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 87.8 | VPN (RGB + Pose) |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 87.8 | VPN |
| Action Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 86.3 | VPN |
| Action Recognition | N-UCLA | Accuracy | 93.5 | VPN (RGB + Pose) |