Sujoy Paul, Sourya Roy, Amit K. Roy-Chowdhury
Most activity localization methods in the literature suffer from the burden of frame-wise annotation requirement. Learning from weak labels may be a potential solution towards reducing such manual labeling effort. Recent years have witnessed a substantial influx of tagged videos on the Internet, which can serve as a rich source of weakly-supervised training data. Specifically, the correlations between videos with similar tags can be utilized to temporally localize the activities. Towards this goal, we present W-TALC, a Weakly-supervised Temporal Activity Localization and Classification framework using only video-level labels. The proposed network can be divided into two sub-networks, namely the Two-Stream based feature extractor network and a weakly-supervised module, which we learn by optimizing two complimentary loss functions. Qualitative and quantitative results on two challenging datasets - Thumos14 and ActivityNet1.2, demonstrate that the proposed method is able to detect activities at a fine granularity and achieve better performance than current state-of-the-art methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | THUMOS 2014 | mAP@0.5 | 22.8 | W-TALC |
| Video | FineAction | mAP | 3.45 | W-TALC |
| Video | FineAction | mAP IOU@0.5 | 6.18 | W-TALC |
| Video | FineAction | mAP IOU@0.75 | 3.15 | W-TALC |
| Video | FineAction | mAP IOU@0.95 | 0.83 | W-TALC |
| Video | ActivityNet-1.2 | mAP@0.5 | 37 | W-TALC |
| Video | THUMOS’14 | mAP | 85.6 | W-TALC |
| Video | ActivityNet-1.2 | mAP | 93.2 | W-TALC |
| Temporal Action Localization | THUMOS 2014 | mAP@0.5 | 22.8 | W-TALC |
| Temporal Action Localization | FineAction | mAP | 3.45 | W-TALC |
| Temporal Action Localization | FineAction | mAP IOU@0.5 | 6.18 | W-TALC |
| Temporal Action Localization | FineAction | mAP IOU@0.75 | 3.15 | W-TALC |
| Temporal Action Localization | FineAction | mAP IOU@0.95 | 0.83 | W-TALC |
| Temporal Action Localization | ActivityNet-1.2 | mAP@0.5 | 37 | W-TALC |
| Zero-Shot Learning | THUMOS 2014 | mAP@0.5 | 22.8 | W-TALC |
| Zero-Shot Learning | FineAction | mAP | 3.45 | W-TALC |
| Zero-Shot Learning | FineAction | mAP IOU@0.5 | 6.18 | W-TALC |
| Zero-Shot Learning | FineAction | mAP IOU@0.75 | 3.15 | W-TALC |
| Zero-Shot Learning | FineAction | mAP IOU@0.95 | 0.83 | W-TALC |
| Zero-Shot Learning | ActivityNet-1.2 | mAP@0.5 | 37 | W-TALC |
| Action Localization | THUMOS 2014 | mAP@0.5 | 22.8 | W-TALC |
| Action Localization | FineAction | mAP | 3.45 | W-TALC |
| Action Localization | FineAction | mAP IOU@0.5 | 6.18 | W-TALC |
| Action Localization | FineAction | mAP IOU@0.75 | 3.15 | W-TALC |
| Action Localization | FineAction | mAP IOU@0.95 | 0.83 | W-TALC |
| Action Localization | ActivityNet-1.2 | mAP@0.5 | 37 | W-TALC |
| Weakly Supervised Action Localization | THUMOS 2014 | mAP@0.5 | 22.8 | W-TALC |
| Weakly Supervised Action Localization | FineAction | mAP | 3.45 | W-TALC |
| Weakly Supervised Action Localization | FineAction | mAP IOU@0.5 | 6.18 | W-TALC |
| Weakly Supervised Action Localization | FineAction | mAP IOU@0.75 | 3.15 | W-TALC |
| Weakly Supervised Action Localization | FineAction | mAP IOU@0.95 | 0.83 | W-TALC |
| Weakly Supervised Action Localization | ActivityNet-1.2 | mAP@0.5 | 37 | W-TALC |