Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, DaCheng Tao
In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | HACS | Average-mAP | 38.6 | TriDet (SlowFast) |
| Video | HACS | mAP@0.5 | 56.7 | TriDet (SlowFast) |
| Video | HACS | mAP@0.75 | 39.3 | TriDet (SlowFast) |
| Video | HACS | mAP@0.95 | 11.7 | TriDet (SlowFast) |
| Video | HACS | Average-mAP | 36.8 | TriDet (I3D RGB) |
| Video | HACS | mAP@0.5 | 54.5 | TriDet (I3D RGB) |
| Video | HACS | mAP@0.75 | 36.8 | TriDet (I3D RGB) |
| Video | HACS | mAP@0.95 | 11.5 | TriDet (I3D RGB) |
| Video | ActivityNet-1.3 | mAP | 36.8 | TriDet (TSP features) |
| Video | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | TriDet (TSP features) |
| Video | ActivityNet-1.3 | mAP IOU@0.75 | 38 | TriDet (TSP features) |
| Video | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | TriDet (TSP features) |
| Video | THUMOS’14 | Avg mAP (0.3:0.7) | 69.3 | TriDet (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.3 | 83.6 | TriDet (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.4 | 80.1 | TriDet (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.5 | 72.9 | TriDet (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.6 | 62.4 | TriDet (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.7 | 47.4 | TriDet (I3D features) |
| Video | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 25.4 | TriDet (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.1 | 28.6 | TriDet (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.2 | 27.4 | TriDet (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.3 | 26.1 | TriDet (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.4 | 24.2 | TriDet (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.5 | 20.8 | TriDet (verb) |
| Temporal Action Localization | HACS | Average-mAP | 38.6 | TriDet (SlowFast) |
| Temporal Action Localization | HACS | mAP@0.5 | 56.7 | TriDet (SlowFast) |
| Temporal Action Localization | HACS | mAP@0.75 | 39.3 | TriDet (SlowFast) |
| Temporal Action Localization | HACS | mAP@0.95 | 11.7 | TriDet (SlowFast) |
| Temporal Action Localization | HACS | Average-mAP | 36.8 | TriDet (I3D RGB) |
| Temporal Action Localization | HACS | mAP@0.5 | 54.5 | TriDet (I3D RGB) |
| Temporal Action Localization | HACS | mAP@0.75 | 36.8 | TriDet (I3D RGB) |
| Temporal Action Localization | HACS | mAP@0.95 | 11.5 | TriDet (I3D RGB) |
| Temporal Action Localization | ActivityNet-1.3 | mAP | 36.8 | TriDet (TSP features) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | TriDet (TSP features) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.75 | 38 | TriDet (TSP features) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | TriDet (TSP features) |
| Temporal Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 69.3 | TriDet (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.3 | 83.6 | TriDet (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.4 | 80.1 | TriDet (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.5 | 72.9 | TriDet (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.6 | 62.4 | TriDet (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.7 | 47.4 | TriDet (I3D features) |
| Temporal Action Localization | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 25.4 | TriDet (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.1 | 28.6 | TriDet (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.2 | 27.4 | TriDet (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.3 | 26.1 | TriDet (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.4 | 24.2 | TriDet (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.5 | 20.8 | TriDet (verb) |
| Zero-Shot Learning | HACS | Average-mAP | 38.6 | TriDet (SlowFast) |
| Zero-Shot Learning | HACS | mAP@0.5 | 56.7 | TriDet (SlowFast) |
| Zero-Shot Learning | HACS | mAP@0.75 | 39.3 | TriDet (SlowFast) |
| Zero-Shot Learning | HACS | mAP@0.95 | 11.7 | TriDet (SlowFast) |
| Zero-Shot Learning | HACS | Average-mAP | 36.8 | TriDet (I3D RGB) |
| Zero-Shot Learning | HACS | mAP@0.5 | 54.5 | TriDet (I3D RGB) |
| Zero-Shot Learning | HACS | mAP@0.75 | 36.8 | TriDet (I3D RGB) |
| Zero-Shot Learning | HACS | mAP@0.95 | 11.5 | TriDet (I3D RGB) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP | 36.8 | TriDet (TSP features) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | TriDet (TSP features) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.75 | 38 | TriDet (TSP features) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | TriDet (TSP features) |
| Zero-Shot Learning | THUMOS’14 | Avg mAP (0.3:0.7) | 69.3 | TriDet (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.3 | 83.6 | TriDet (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.4 | 80.1 | TriDet (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.5 | 72.9 | TriDet (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.6 | 62.4 | TriDet (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.7 | 47.4 | TriDet (I3D features) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 25.4 | TriDet (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.1 | 28.6 | TriDet (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.2 | 27.4 | TriDet (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.3 | 26.1 | TriDet (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.4 | 24.2 | TriDet (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.5 | 20.8 | TriDet (verb) |
| Action Localization | HACS | Average-mAP | 38.6 | TriDet (SlowFast) |
| Action Localization | HACS | mAP@0.5 | 56.7 | TriDet (SlowFast) |
| Action Localization | HACS | mAP@0.75 | 39.3 | TriDet (SlowFast) |
| Action Localization | HACS | mAP@0.95 | 11.7 | TriDet (SlowFast) |
| Action Localization | HACS | Average-mAP | 36.8 | TriDet (I3D RGB) |
| Action Localization | HACS | mAP@0.5 | 54.5 | TriDet (I3D RGB) |
| Action Localization | HACS | mAP@0.75 | 36.8 | TriDet (I3D RGB) |
| Action Localization | HACS | mAP@0.95 | 11.5 | TriDet (I3D RGB) |
| Action Localization | ActivityNet-1.3 | mAP | 36.8 | TriDet (TSP features) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | TriDet (TSP features) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.75 | 38 | TriDet (TSP features) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | TriDet (TSP features) |
| Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 69.3 | TriDet (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.3 | 83.6 | TriDet (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.4 | 80.1 | TriDet (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.5 | 72.9 | TriDet (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.6 | 62.4 | TriDet (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.7 | 47.4 | TriDet (I3D features) |
| Action Localization | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 25.4 | TriDet (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.1 | 28.6 | TriDet (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.2 | 27.4 | TriDet (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.3 | 26.1 | TriDet (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.4 | 24.2 | TriDet (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.5 | 20.8 | TriDet (verb) |