Chenlin Zhang, Jianxin Wu, Yin Li
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | ActivityNet-1.3 | mAP | 36.6 | ActionFormer (TSP feautures) |
| Video | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | ActionFormer (TSP feautures) |
| Video | ActivityNet-1.3 | mAP IOU@0.75 | 37.8 | ActionFormer (TSP feautures) |
| Video | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | ActionFormer (TSP feautures) |
| Video | THUMOS’14 | Avg mAP (0.3:0.7) | 66.8 | ActionFormer (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.3 | 82.1 | ActionFormer (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.4 | 77.8 | ActionFormer (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.5 | 71 | ActionFormer (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.6 | 59.4 | ActionFormer (I3D features) |
| Video | THUMOS’14 | mAP IOU@0.7 | 43.9 | ActionFormer (I3D features) |
| Video | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 23.5 | ActionFormer (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.1 | 26.6 | ActionFormer (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.2 | 25.4 | ActionFormer (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.3 | 24.2 | ActionFormer (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.4 | 22.3 | ActionFormer (verb) |
| Video | EPIC-KITCHENS-100 | mAP IOU@0.5 | 19.1 | ActionFormer (verb) |
| Temporal Action Localization | ActivityNet-1.3 | mAP | 36.6 | ActionFormer (TSP feautures) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | ActionFormer (TSP feautures) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.75 | 37.8 | ActionFormer (TSP feautures) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | ActionFormer (TSP feautures) |
| Temporal Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 66.8 | ActionFormer (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.3 | 82.1 | ActionFormer (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.4 | 77.8 | ActionFormer (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.5 | 71 | ActionFormer (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.6 | 59.4 | ActionFormer (I3D features) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.7 | 43.9 | ActionFormer (I3D features) |
| Temporal Action Localization | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 23.5 | ActionFormer (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.1 | 26.6 | ActionFormer (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.2 | 25.4 | ActionFormer (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.3 | 24.2 | ActionFormer (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.4 | 22.3 | ActionFormer (verb) |
| Temporal Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.5 | 19.1 | ActionFormer (verb) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP | 36.6 | ActionFormer (TSP feautures) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | ActionFormer (TSP feautures) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.75 | 37.8 | ActionFormer (TSP feautures) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | ActionFormer (TSP feautures) |
| Zero-Shot Learning | THUMOS’14 | Avg mAP (0.3:0.7) | 66.8 | ActionFormer (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.3 | 82.1 | ActionFormer (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.4 | 77.8 | ActionFormer (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.5 | 71 | ActionFormer (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.6 | 59.4 | ActionFormer (I3D features) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.7 | 43.9 | ActionFormer (I3D features) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 23.5 | ActionFormer (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.1 | 26.6 | ActionFormer (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.2 | 25.4 | ActionFormer (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.3 | 24.2 | ActionFormer (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.4 | 22.3 | ActionFormer (verb) |
| Zero-Shot Learning | EPIC-KITCHENS-100 | mAP IOU@0.5 | 19.1 | ActionFormer (verb) |
| audio-visual event localization | UnAV-100 | mAP | 42.2 | ActionFormer |
| audio-visual event localization | UnAV-100 | AP@IOU0.5 | 43.5 | ActionFormer |
| Action Localization | ActivityNet-1.3 | mAP | 36.6 | ActionFormer (TSP feautures) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.5 | 54.7 | ActionFormer (TSP feautures) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.75 | 37.8 | ActionFormer (TSP feautures) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.95 | 8.4 | ActionFormer (TSP feautures) |
| Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 66.8 | ActionFormer (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.3 | 82.1 | ActionFormer (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.4 | 77.8 | ActionFormer (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.5 | 71 | ActionFormer (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.6 | 59.4 | ActionFormer (I3D features) |
| Action Localization | THUMOS’14 | mAP IOU@0.7 | 43.9 | ActionFormer (I3D features) |
| Action Localization | EPIC-KITCHENS-100 | Avg mAP (0.1-0.5) | 23.5 | ActionFormer (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.1 | 26.6 | ActionFormer (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.2 | 25.4 | ActionFormer (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.3 | 24.2 | ActionFormer (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.4 | 22.3 | ActionFormer (verb) |
| Action Localization | EPIC-KITCHENS-100 | mAP IOU@0.5 | 19.1 | ActionFormer (verb) |