Adrian Holzbock, Alexander Tsaregorodtsev, Youssef Dawoud, Klaus Dietmayer, Vasileios Belagiannis
Gesture recognition is essential for the interaction of autonomous vehicles with humans. While the current approaches focus on combining several modalities like image features, keypoints and bone vectors, we present neural network architecture that delivers state-of-the-art results only with body skeleton input data. We propose the spatio-temporal multilayer perceptron for gesture recognition in the context of autonomous vehicles. Given 3D body poses over time, we define temporal and spatial mixing operations to extract features in both domains. Additionally, the importance of each time step is re-weighted with Squeeze-and-Excitation layers. An extensive evaluation of the TCG and Drive&Act datasets is provided to showcase the promising performance of our approach. Furthermore, we deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | TCG-dataset | Acc | 85.99 | stMLP |
| Video | TCG-dataset | F1-Score | 80.05 | stMLP |
| Video | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| Video | Drive&Act | mean per-class accuracy | 34.61 | stMLP |
| Temporal Action Localization | TCG-dataset | Acc | 85.99 | stMLP |
| Temporal Action Localization | TCG-dataset | F1-Score | 80.05 | stMLP |
| Temporal Action Localization | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| Temporal Action Localization | Drive&Act | mean per-class accuracy | 34.61 | stMLP |
| Zero-Shot Learning | TCG-dataset | Acc | 85.99 | stMLP |
| Zero-Shot Learning | TCG-dataset | F1-Score | 80.05 | stMLP |
| Zero-Shot Learning | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| Zero-Shot Learning | Drive&Act | mean per-class accuracy | 34.61 | stMLP |
| Activity Recognition | TCG-dataset | Acc | 85.99 | stMLP |
| Activity Recognition | TCG-dataset | F1-Score | 80.05 | stMLP |
| Activity Recognition | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| Activity Recognition | Drive&Act | mean per-class accuracy | 34.61 | stMLP |
| Action Localization | TCG-dataset | Acc | 85.99 | stMLP |
| Action Localization | TCG-dataset | F1-Score | 80.05 | stMLP |
| Action Localization | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| Action Localization | Drive&Act | mean per-class accuracy | 34.61 | stMLP |
| Action Detection | TCG-dataset | Acc | 85.99 | stMLP |
| Action Detection | TCG-dataset | F1-Score | 80.05 | stMLP |
| Action Detection | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| Action Detection | Drive&Act | mean per-class accuracy | 34.61 | stMLP |
| 3D Action Recognition | TCG-dataset | Acc | 85.99 | stMLP |
| 3D Action Recognition | TCG-dataset | F1-Score | 80.05 | stMLP |
| 3D Action Recognition | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| 3D Action Recognition | Drive&Act | mean per-class accuracy | 34.61 | stMLP |
| Action Recognition | TCG-dataset | Acc | 85.99 | stMLP |
| Action Recognition | TCG-dataset | F1-Score | 80.05 | stMLP |
| Action Recognition | TCG-dataset | Jaccard Index | 67.88 | stMLP |
| Action Recognition | Drive&Act | mean per-class accuracy | 34.61 | stMLP |