Pilhyeon Lee, Hyeran Byun
We tackle the problem of localizing temporal intervals of actions with only a single frame label for each action instance for training. Owing to label sparsity, existing work fails to learn action completeness, resulting in fragmentary action predictions. In this paper, we propose a novel framework, where dense pseudo-labels are generated to provide completeness guidance for the model. Concretely, we first select pseudo background points to supplement point-level action labels. Then, by taking the points as seeds, we search for the optimal sequence that is likely to contain complete action instances while agreeing with the seeds. To learn completeness from the obtained sequence, we introduce two novel losses that contrast action instances with background ones in terms of action score and feature similarity, respectively. Experimental results demonstrate that our completeness guidance indeed helps the model to locate complete action instances, leading to large performance gains especially under high IoU thresholds. Moreover, we demonstrate the superiority of our method over existing state-of-the-art methods on four benchmarks: THUMOS'14, GTEA, BEOID, and ActivityNet. Notably, our method even performs comparably to recent fully-supervised methods, at the 6 times cheaper annotation cost. Our code is available at https://github.com/Pilhyeon.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | GTEA | mAP@0.1:0.7 | 43.5 | LACP |
| Video | GTEA | mAP@0.5 | 33.9 | LACP |
| Video | BEOID | mAP@0.1:0.7 | 51.8 | LACP |
| Video | BEOID | mAP@0.5 | 42.7 | LACP |
| Video | THUMOS 2014 | mAP@0.1:0.5 | 62.7 | LACP |
| Video | THUMOS 2014 | mAP@0.1:0.7 | 52.8 | LACP |
| Video | THUMOS 2014 | mAP@0.5 | 45.3 | LACP |
| Video | THUMOS14 | avg-mAP (0.1-0.5) | 62.7 | LACP |
| Video | THUMOS14 | avg-mAP (0.1:0.7) | 52.8 | LACP |
| Video | THUMOS14 | avg-mAP (0.3-0.7) | 44.5 | LACP |
| Video | THUMOS’14 | mAP@0.5 | 45.3 | LACP |
| Video | ActivityNet-1.3 | mAP@0.5 | 40.4 | LACP |
| Video | ActivityNet-1.3 | mAP@0.5:0.95 | 25.1 | LACP |
| Video | ActivityNet-1.2 | Mean mAP | 26.8 | LACP |
| Video | ActivityNet-1.2 | mAP@0.5 | 44 | LACP |
| Temporal Action Localization | GTEA | mAP@0.1:0.7 | 43.5 | LACP |
| Temporal Action Localization | GTEA | mAP@0.5 | 33.9 | LACP |
| Temporal Action Localization | BEOID | mAP@0.1:0.7 | 51.8 | LACP |
| Temporal Action Localization | BEOID | mAP@0.5 | 42.7 | LACP |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 62.7 | LACP |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 52.8 | LACP |
| Temporal Action Localization | THUMOS 2014 | mAP@0.5 | 45.3 | LACP |
| Temporal Action Localization | THUMOS14 | avg-mAP (0.1-0.5) | 62.7 | LACP |
| Temporal Action Localization | THUMOS14 | avg-mAP (0.1:0.7) | 52.8 | LACP |
| Temporal Action Localization | THUMOS14 | avg-mAP (0.3-0.7) | 44.5 | LACP |
| Temporal Action Localization | THUMOS’14 | mAP@0.5 | 45.3 | LACP |
| Temporal Action Localization | ActivityNet-1.3 | mAP@0.5 | 40.4 | LACP |
| Temporal Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 25.1 | LACP |
| Temporal Action Localization | ActivityNet-1.2 | Mean mAP | 26.8 | LACP |
| Temporal Action Localization | ActivityNet-1.2 | mAP@0.5 | 44 | LACP |
| Zero-Shot Learning | GTEA | mAP@0.1:0.7 | 43.5 | LACP |
| Zero-Shot Learning | GTEA | mAP@0.5 | 33.9 | LACP |
| Zero-Shot Learning | BEOID | mAP@0.1:0.7 | 51.8 | LACP |
| Zero-Shot Learning | BEOID | mAP@0.5 | 42.7 | LACP |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.5 | 62.7 | LACP |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.7 | 52.8 | LACP |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.5 | 45.3 | LACP |
| Zero-Shot Learning | THUMOS14 | avg-mAP (0.1-0.5) | 62.7 | LACP |
| Zero-Shot Learning | THUMOS14 | avg-mAP (0.1:0.7) | 52.8 | LACP |
| Zero-Shot Learning | THUMOS14 | avg-mAP (0.3-0.7) | 44.5 | LACP |
| Zero-Shot Learning | THUMOS’14 | mAP@0.5 | 45.3 | LACP |
| Zero-Shot Learning | ActivityNet-1.3 | mAP@0.5 | 40.4 | LACP |
| Zero-Shot Learning | ActivityNet-1.3 | mAP@0.5:0.95 | 25.1 | LACP |
| Zero-Shot Learning | ActivityNet-1.2 | Mean mAP | 26.8 | LACP |
| Zero-Shot Learning | ActivityNet-1.2 | mAP@0.5 | 44 | LACP |
| Action Localization | GTEA | mAP@0.1:0.7 | 43.5 | LACP |
| Action Localization | GTEA | mAP@0.5 | 33.9 | LACP |
| Action Localization | BEOID | mAP@0.1:0.7 | 51.8 | LACP |
| Action Localization | BEOID | mAP@0.5 | 42.7 | LACP |
| Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 62.7 | LACP |
| Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 52.8 | LACP |
| Action Localization | THUMOS 2014 | mAP@0.5 | 45.3 | LACP |
| Action Localization | THUMOS14 | avg-mAP (0.1-0.5) | 62.7 | LACP |
| Action Localization | THUMOS14 | avg-mAP (0.1:0.7) | 52.8 | LACP |
| Action Localization | THUMOS14 | avg-mAP (0.3-0.7) | 44.5 | LACP |
| Action Localization | THUMOS’14 | mAP@0.5 | 45.3 | LACP |
| Action Localization | ActivityNet-1.3 | mAP@0.5 | 40.4 | LACP |
| Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 25.1 | LACP |
| Action Localization | ActivityNet-1.2 | Mean mAP | 26.8 | LACP |
| Action Localization | ActivityNet-1.2 | mAP@0.5 | 44 | LACP |
| Weakly Supervised Action Localization | GTEA | mAP@0.1:0.7 | 43.5 | LACP |
| Weakly Supervised Action Localization | GTEA | mAP@0.5 | 33.9 | LACP |
| Weakly Supervised Action Localization | BEOID | mAP@0.1:0.7 | 51.8 | LACP |
| Weakly Supervised Action Localization | BEOID | mAP@0.5 | 42.7 | LACP |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 62.7 | LACP |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 52.8 | LACP |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.5 | 45.3 | LACP |
| Weakly Supervised Action Localization | THUMOS14 | avg-mAP (0.1-0.5) | 62.7 | LACP |
| Weakly Supervised Action Localization | THUMOS14 | avg-mAP (0.1:0.7) | 52.8 | LACP |
| Weakly Supervised Action Localization | THUMOS14 | avg-mAP (0.3-0.7) | 44.5 | LACP |
| Weakly Supervised Action Localization | THUMOS’14 | mAP@0.5 | 45.3 | LACP |
| Weakly Supervised Action Localization | ActivityNet-1.3 | mAP@0.5 | 40.4 | LACP |
| Weakly Supervised Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 25.1 | LACP |
| Weakly Supervised Action Localization | ActivityNet-1.2 | Mean mAP | 26.8 | LACP |
| Weakly Supervised Action Localization | ActivityNet-1.2 | mAP@0.5 | 44 | LACP |