Chen Ju, Peisen Zhao, Ya zhang, Yanfeng Wang, Qi Tian
Point-Level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance. Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels. However, such a framework inevitably suffers from a large solution space. This paper attempts to explore the proposal-based prediction paradigm for point-level annotations, which has the advantage of more constrained solution space and consistent predictions among neighboring frames. The point-level annotations are first used as the keypoint supervision to train a keypoint detector. At the location prediction stage, a simple but effective mapper module, which enables back-propagation of training errors, is then introduced to bridge the fully-supervised framework with weak supervision. To our best of knowledge, this is the first work to leverage the fully-supervised paradigm for the point-level setting. Experiments on THUMOS14, BEOID, and GTEA verify the effectiveness of our proposed method both quantitatively and qualitatively, and demonstrate that our method outperforms state-of-the-art methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | GTEA | mAP@0.1:0.7 | 33.7 | Ju et al. |
| Video | GTEA | mAP@0.5 | 21.9 | Ju et al. |
| Video | BEOID | mAP@0.1:0.7 | 34.9 | Ju et al. |
| Video | BEOID | mAP@0.5 | 20.9 | Ju et al. |
| Video | THUMOS 2014 | mAP@0.1:0.5 | 55.6 | Ju et al. |
| Video | THUMOS 2014 | mAP@0.1:0.7 | 44.8 | Ju et al. |
| Video | THUMOS 2014 | mAP@0.5 | 35.9 | Ju et al. |
| Video | THUMOS14 | avg-mAP (0.1-0.5) | 55.6 | Ju et al. |
| Video | THUMOS14 | avg-mAP (0.1:0.7) | 44.8 | Ju et al. |
| Video | THUMOS14 | avg-mAP (0.3-0.7) | 35.4 | Ju et al. |
| Video | THUMOS’14 | mAP@0.5 | 35.9 | Ju et al. |
| Temporal Action Localization | GTEA | mAP@0.1:0.7 | 33.7 | Ju et al. |
| Temporal Action Localization | GTEA | mAP@0.5 | 21.9 | Ju et al. |
| Temporal Action Localization | BEOID | mAP@0.1:0.7 | 34.9 | Ju et al. |
| Temporal Action Localization | BEOID | mAP@0.5 | 20.9 | Ju et al. |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 55.6 | Ju et al. |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 44.8 | Ju et al. |
| Temporal Action Localization | THUMOS 2014 | mAP@0.5 | 35.9 | Ju et al. |
| Temporal Action Localization | THUMOS14 | avg-mAP (0.1-0.5) | 55.6 | Ju et al. |
| Temporal Action Localization | THUMOS14 | avg-mAP (0.1:0.7) | 44.8 | Ju et al. |
| Temporal Action Localization | THUMOS14 | avg-mAP (0.3-0.7) | 35.4 | Ju et al. |
| Temporal Action Localization | THUMOS’14 | mAP@0.5 | 35.9 | Ju et al. |
| Zero-Shot Learning | GTEA | mAP@0.1:0.7 | 33.7 | Ju et al. |
| Zero-Shot Learning | GTEA | mAP@0.5 | 21.9 | Ju et al. |
| Zero-Shot Learning | BEOID | mAP@0.1:0.7 | 34.9 | Ju et al. |
| Zero-Shot Learning | BEOID | mAP@0.5 | 20.9 | Ju et al. |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.5 | 55.6 | Ju et al. |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.7 | 44.8 | Ju et al. |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.5 | 35.9 | Ju et al. |
| Zero-Shot Learning | THUMOS14 | avg-mAP (0.1-0.5) | 55.6 | Ju et al. |
| Zero-Shot Learning | THUMOS14 | avg-mAP (0.1:0.7) | 44.8 | Ju et al. |
| Zero-Shot Learning | THUMOS14 | avg-mAP (0.3-0.7) | 35.4 | Ju et al. |
| Zero-Shot Learning | THUMOS’14 | mAP@0.5 | 35.9 | Ju et al. |
| Action Localization | GTEA | mAP@0.1:0.7 | 33.7 | Ju et al. |
| Action Localization | GTEA | mAP@0.5 | 21.9 | Ju et al. |
| Action Localization | BEOID | mAP@0.1:0.7 | 34.9 | Ju et al. |
| Action Localization | BEOID | mAP@0.5 | 20.9 | Ju et al. |
| Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 55.6 | Ju et al. |
| Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 44.8 | Ju et al. |
| Action Localization | THUMOS 2014 | mAP@0.5 | 35.9 | Ju et al. |
| Action Localization | THUMOS14 | avg-mAP (0.1-0.5) | 55.6 | Ju et al. |
| Action Localization | THUMOS14 | avg-mAP (0.1:0.7) | 44.8 | Ju et al. |
| Action Localization | THUMOS14 | avg-mAP (0.3-0.7) | 35.4 | Ju et al. |
| Action Localization | THUMOS’14 | mAP@0.5 | 35.9 | Ju et al. |
| Weakly Supervised Action Localization | GTEA | mAP@0.1:0.7 | 33.7 | Ju et al. |
| Weakly Supervised Action Localization | GTEA | mAP@0.5 | 21.9 | Ju et al. |
| Weakly Supervised Action Localization | BEOID | mAP@0.1:0.7 | 34.9 | Ju et al. |
| Weakly Supervised Action Localization | BEOID | mAP@0.5 | 20.9 | Ju et al. |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 55.6 | Ju et al. |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 44.8 | Ju et al. |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.5 | 35.9 | Ju et al. |
| Weakly Supervised Action Localization | THUMOS14 | avg-mAP (0.1-0.5) | 55.6 | Ju et al. |
| Weakly Supervised Action Localization | THUMOS14 | avg-mAP (0.1:0.7) | 44.8 | Ju et al. |
| Weakly Supervised Action Localization | THUMOS14 | avg-mAP (0.3-0.7) | 35.4 | Ju et al. |
| Weakly Supervised Action Localization | THUMOS’14 | mAP@0.5 | 35.9 | Ju et al. |