Qinying Liu, Zilei Wang, Shenghai Rong, Junjie Li, Yixin Zhang
Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F\&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F\&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F\&B labels, thereby boosting the F\&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at https://github.com/Qinying-Liu/CASE
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | THUMOS 2014 | mAP@0.1:0.7 | 49.2 | CASE + Zhou et al. |
| Video | THUMOS 2014 | mAP@0.1:0.5 | 57.1 | CASE |
| Video | THUMOS 2014 | mAP@0.1:0.7 | 46.2 | CASE |
| Video | ActivityNet-1.3 | mAP@0.5 | 43.2 | CASE |
| Video | ActivityNet-1.3 | mAP@0.5:0.95 | 26.8 | CASE |
| Video | ActivityNet-1.2 | Mean mAP | 27.9 | CASE |
| Video | ActivityNet-1.2 | mAP@0.5 | 43.8 | CASE |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 49.2 | CASE + Zhou et al. |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 57.1 | CASE |
| Temporal Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 46.2 | CASE |
| Temporal Action Localization | ActivityNet-1.3 | mAP@0.5 | 43.2 | CASE |
| Temporal Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 26.8 | CASE |
| Temporal Action Localization | ActivityNet-1.2 | Mean mAP | 27.9 | CASE |
| Temporal Action Localization | ActivityNet-1.2 | mAP@0.5 | 43.8 | CASE |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.7 | 49.2 | CASE + Zhou et al. |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.5 | 57.1 | CASE |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.1:0.7 | 46.2 | CASE |
| Zero-Shot Learning | ActivityNet-1.3 | mAP@0.5 | 43.2 | CASE |
| Zero-Shot Learning | ActivityNet-1.3 | mAP@0.5:0.95 | 26.8 | CASE |
| Zero-Shot Learning | ActivityNet-1.2 | Mean mAP | 27.9 | CASE |
| Zero-Shot Learning | ActivityNet-1.2 | mAP@0.5 | 43.8 | CASE |
| Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 49.2 | CASE + Zhou et al. |
| Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 57.1 | CASE |
| Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 46.2 | CASE |
| Action Localization | ActivityNet-1.3 | mAP@0.5 | 43.2 | CASE |
| Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 26.8 | CASE |
| Action Localization | ActivityNet-1.2 | Mean mAP | 27.9 | CASE |
| Action Localization | ActivityNet-1.2 | mAP@0.5 | 43.8 | CASE |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 49.2 | CASE + Zhou et al. |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.5 | 57.1 | CASE |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.1:0.7 | 46.2 | CASE |
| Weakly Supervised Action Localization | ActivityNet-1.3 | mAP@0.5 | 43.2 | CASE |
| Weakly Supervised Action Localization | ActivityNet-1.3 | mAP@0.5:0.95 | 26.8 | CASE |
| Weakly Supervised Action Localization | ActivityNet-1.2 | Mean mAP | 27.9 | CASE |
| Weakly Supervised Action Localization | ActivityNet-1.2 | mAP@0.5 | 43.8 | CASE |