Alexandros Stergiou, Ronald Poppe
Generalizing over temporal variations is a prerequisite for effective action recognition in videos. Despite significant advances in deep neural networks, it remains a challenge to focus on short-term discriminative motions in relation to the overall performance of an action. We address this challenge by allowing some flexibility in discovering relevant spatio-temporal features. We introduce Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs with similar activations with potential temporal variations. We implement this idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics, in conjunction with a temporal gate that is responsible for evaluating the consistency of the discovered dynamics and the modeled features. We show consistent improvement when using SRTG blocks, with only a minimal increase in the number of GFLOPs. On Kinetics-700, we perform on par with current state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101 and HMDB-51.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-700 | Top-1 Accuracy | 56.46 | SRTG r3d-101 |
| Video | Kinetics-700 | Top-5 Accuracy | 76.82 | SRTG r3d-101 |
| Video | Kinetics-700 | Top-1 Accuracy | 54.17 | SRTG r(2+1)d-50 |
| Video | Kinetics-700 | Top-5 Accuracy | 74.62 | SRTG r(2+1)d-50 |
| Video | Kinetics-700 | Top-1 Accuracy | 53.52 | SRTG r3d-50 |
| Video | Kinetics-700 | Top-5 Accuracy | 74.17 | SRTG r3d-50 |
| Video | Kinetics-700 | Top-1 Accuracy | 49.43 | SRTG r(2+1)d-34 |
| Video | Kinetics-700 | Top-5 Accuracy | 73.23 | SRTG r(2+1)d-34 |
| Video | Kinetics-700 | Top-1 Accuracy | 49.15 | SRTG r3d-34 |
| Video | Kinetics-700 | Top-5 Accuracy | 72.68 | SRTG r3d-34 |
| Video | MiT | Top 1 Accuracy | 33.56 | SRTG r3d-101 |
| Video | MiT | Top 5 Accuracy | 58.49 | SRTG r3d-101 |
| Video | MiT | Top 1 Accuracy | 31.6 | SRTG r(2+1)d-50 |
| Video | MiT | Top 5 Accuracy | 56.8 | SRTG r(2+1)d-50 |
| Video | MiT | Top 1 Accuracy | 30.72 | SRTG r3d-50 |
| Video | MiT | Top 5 Accuracy | 55.65 | SRTG r3d-50 |
| Video | MiT | Top 1 Accuracy | 28.97 | SRTG r(2+1)d-34 |
| Video | MiT | Top 5 Accuracy | 54.18 | SRTG r(2+1)d-34 |
| Video | MiT | Top 1 Accuracy | 28.55 | SRTG r3d-34 |
| Video | MiT | Top 5 Accuracy | 52.35 | SRTG r3d-34 |
| Activity Recognition | HACS | Top 1 Accuracy | 84.33 | SRTG r(2+1)d-101 |
| Activity Recognition | HACS | Top 5 Accuracy | 96.85 | SRTG r(2+1)d-101 |
| Activity Recognition | HACS | Top 1 Accuracy | 83.77 | SRTG r(2+1)d-50 |
| Activity Recognition | HACS | Top 5 Accuracy | 96.56 | SRTG r(2+1)d-50 |
| Activity Recognition | HACS | Top 1 Accuracy | 81.66 | SRTG r3d-101 |
| Activity Recognition | HACS | Top 5 Accuracy | 96.33 | SRTG r3d-101 |
| Activity Recognition | HACS | Top 1 Accuracy | 80.39 | SRTG r(2+1)d-34 |
| Activity Recognition | HACS | Top 5 Accuracy | 94.27 | SRTG r(2+1)d-34 |
| Activity Recognition | HACS | Top 1 Accuracy | 80.36 | SRTG r3d-50 |
| Activity Recognition | HACS | Top 5 Accuracy | 95.55 | SRTG r3d-50 |
| Activity Recognition | HACS | Top 1 Accuracy | 78.6 | SRTG r3d-34 |
| Activity Recognition | HACS | Top 5 Accuracy | 93.57 | SRTG r3d-34 |
| Action Recognition | HACS | Top 1 Accuracy | 84.33 | SRTG r(2+1)d-101 |
| Action Recognition | HACS | Top 5 Accuracy | 96.85 | SRTG r(2+1)d-101 |
| Action Recognition | HACS | Top 1 Accuracy | 83.77 | SRTG r(2+1)d-50 |
| Action Recognition | HACS | Top 5 Accuracy | 96.56 | SRTG r(2+1)d-50 |
| Action Recognition | HACS | Top 1 Accuracy | 81.66 | SRTG r3d-101 |
| Action Recognition | HACS | Top 5 Accuracy | 96.33 | SRTG r3d-101 |
| Action Recognition | HACS | Top 1 Accuracy | 80.39 | SRTG r(2+1)d-34 |
| Action Recognition | HACS | Top 5 Accuracy | 94.27 | SRTG r(2+1)d-34 |
| Action Recognition | HACS | Top 1 Accuracy | 80.36 | SRTG r3d-50 |
| Action Recognition | HACS | Top 5 Accuracy | 95.55 | SRTG r3d-50 |
| Action Recognition | HACS | Top 1 Accuracy | 78.6 | SRTG r3d-34 |
| Action Recognition | HACS | Top 5 Accuracy | 93.57 | SRTG r3d-34 |