Pilhyeon Lee, Youngjung Uh, Hyeran Byun
Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each video contains action frames of interest. Previous methods aggregate frame-level class scores to produce video-level prediction and learn from video-level action labels. This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately. In this paper, we design Background Suppression Network (BaS-Net) which introduces an auxiliary class for background and has a two-branch weight-sharing architecture with an asymmetrical training strategy. This enables BaS-Net to suppress activations from background frames to improve localization performance. Extensive experiments demonstrate the effectiveness of BaS-Net and its superiority over the state-of-the-art methods on the most popular benchmarks - THUMOS'14 and ActivityNet. Our code and the trained model are available at https://github.com/Pilhyeon/BaSNet-pytorch.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | THUMOS 2014 | mAP@0.1:0.5 | 43.6 | BaS-Net |
| Video | THUMOS 2014 | mAP@0.1:0.7 | 35.3 | BaS-Net |
| Video | THUMOS 2014 | mAP@0.5 | 27 | BaS-Net |
| Video | THUMOS’14 | mAP@0.5 | 27 | BasNet |
| Video | ActivityNet-1.3 | mAP@0.5 | 34.5 | BaS-Net |
| Video | ActivityNet-1.3 | mAP@0.5:0.95 | 22.2 | BaS-Net |
| Video | ActivityNet-1.2 | mAP@0.5 | 38.5 | BaS-Net |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 43.6 | BaS-Net |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 35.3 | BaS-Net |
| Temporal Action Localization | THUMOS 2014 | mAP@0.5 | 27 | BaS-Net |
| Temporal Action Localization | THUMOS’14 | mAP@0.5 | 27 | BasNet |
| Temporal Action Localization | ActivityNet-1.3 | mAP@0.5 | 34.5 | BaS-Net |
| Temporal Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 22.2 | BaS-Net |
| Temporal Action Localization | ActivityNet-1.2 | mAP@0.5 | 38.5 | BaS-Net |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.5 | 43.6 | BaS-Net |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.7 | 35.3 | BaS-Net |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.5 | 27 | BaS-Net |
| Zero-Shot Learning | THUMOS’14 | mAP@0.5 | 27 | BasNet |
| Zero-Shot Learning | ActivityNet-1.3 | mAP@0.5 | 34.5 | BaS-Net |
| Zero-Shot Learning | ActivityNet-1.3 | mAP@0.5:0.95 | 22.2 | BaS-Net |
| Zero-Shot Learning | ActivityNet-1.2 | mAP@0.5 | 38.5 | BaS-Net |
| Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 43.6 | BaS-Net |
| Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 35.3 | BaS-Net |
| Action Localization | THUMOS 2014 | mAP@0.5 | 27 | BaS-Net |
| Action Localization | THUMOS’14 | mAP@0.5 | 27 | BasNet |
| Action Localization | ActivityNet-1.3 | mAP@0.5 | 34.5 | BaS-Net |
| Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 22.2 | BaS-Net |
| Action Localization | ActivityNet-1.2 | mAP@0.5 | 38.5 | BaS-Net |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 43.6 | BaS-Net |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 35.3 | BaS-Net |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.5 | 27 | BaS-Net |
| Weakly Supervised Action Localization | THUMOS’14 | mAP@0.5 | 27 | BasNet |
| Weakly Supervised Action Localization | ActivityNet-1.3 | mAP@0.5 | 34.5 | BaS-Net |
| Weakly Supervised Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 22.2 | BaS-Net |
| Weakly Supervised Action Localization | ActivityNet-1.2 | mAP@0.5 | 38.5 | BaS-Net |