Sanath Narayan, Hisham Cholakkal, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, Ling Shao
This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on multiple benchmarks, including THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on all datasets, achieving gains as high as 2.3% in terms of mAP at IoU=0.5 on THUMOS14. Source code is available at https://github.com/naraysa/D2-Net
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | THUMOS 2014 | mAP@0.1:0.5 | 51.4 | D2-Net |
| Video | THUMOS 2014 | mAP@0.5 | 35.9 | D2-Net |
| Video | FineAction | mAP | 3.35 | D2-Net |
| Video | FineAction | mAP IOU@0.5 | 6.75 | D2-Net |
| Video | FineAction | mAP IOU@0.75 | 3.02 | D2-Net |
| Video | FineAction | mAP IOU@0.95 | 0.82 | D2-Net |
| Video | THUMOS’14 | mAP@0.5 | 35.9 | D2-Net |
| Video | ActivityNet-1.2 | Mean mAP | 26 | D2-Net |
| Video | ActivityNet-1.2 | mAP@0.5 | 42.3 | D2-Net |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 51.4 | D2-Net |
| Temporal Action Localization | THUMOS 2014 | mAP@0.5 | 35.9 | D2-Net |
| Temporal Action Localization | FineAction | mAP | 3.35 | D2-Net |
| Temporal Action Localization | FineAction | mAP IOU@0.5 | 6.75 | D2-Net |
| Temporal Action Localization | FineAction | mAP IOU@0.75 | 3.02 | D2-Net |
| Temporal Action Localization | FineAction | mAP IOU@0.95 | 0.82 | D2-Net |
| Temporal Action Localization | THUMOS’14 | mAP@0.5 | 35.9 | D2-Net |
| Temporal Action Localization | ActivityNet-1.2 | Mean mAP | 26 | D2-Net |
| Temporal Action Localization | ActivityNet-1.2 | mAP@0.5 | 42.3 | D2-Net |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.5 | 51.4 | D2-Net |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.5 | 35.9 | D2-Net |
| Zero-Shot Learning | FineAction | mAP | 3.35 | D2-Net |
| Zero-Shot Learning | FineAction | mAP IOU@0.5 | 6.75 | D2-Net |
| Zero-Shot Learning | FineAction | mAP IOU@0.75 | 3.02 | D2-Net |
| Zero-Shot Learning | FineAction | mAP IOU@0.95 | 0.82 | D2-Net |
| Zero-Shot Learning | THUMOS’14 | mAP@0.5 | 35.9 | D2-Net |
| Zero-Shot Learning | ActivityNet-1.2 | Mean mAP | 26 | D2-Net |
| Zero-Shot Learning | ActivityNet-1.2 | mAP@0.5 | 42.3 | D2-Net |
| Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 51.4 | D2-Net |
| Action Localization | THUMOS 2014 | mAP@0.5 | 35.9 | D2-Net |
| Action Localization | FineAction | mAP | 3.35 | D2-Net |
| Action Localization | FineAction | mAP IOU@0.5 | 6.75 | D2-Net |
| Action Localization | FineAction | mAP IOU@0.75 | 3.02 | D2-Net |
| Action Localization | FineAction | mAP IOU@0.95 | 0.82 | D2-Net |
| Action Localization | THUMOS’14 | mAP@0.5 | 35.9 | D2-Net |
| Action Localization | ActivityNet-1.2 | Mean mAP | 26 | D2-Net |
| Action Localization | ActivityNet-1.2 | mAP@0.5 | 42.3 | D2-Net |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 51.4 | D2-Net |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.5 | 35.9 | D2-Net |
| Weakly Supervised Action Localization | FineAction | mAP | 3.35 | D2-Net |
| Weakly Supervised Action Localization | FineAction | mAP IOU@0.5 | 6.75 | D2-Net |
| Weakly Supervised Action Localization | FineAction | mAP IOU@0.75 | 3.02 | D2-Net |
| Weakly Supervised Action Localization | FineAction | mAP IOU@0.95 | 0.82 | D2-Net |
| Weakly Supervised Action Localization | THUMOS’14 | mAP@0.5 | 35.9 | D2-Net |
| Weakly Supervised Action Localization | ActivityNet-1.2 | Mean mAP | 26 | D2-Net |
| Weakly Supervised Action Localization | ActivityNet-1.2 | mAP@0.5 | 42.3 | D2-Net |