Sangyoun Lee, Juho Jung, Changdae Oh, Sunghee Yun
Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates the Feature Aggregated Bi-S6 block, Dual Bi-S6 structure, and a recurrent mechanism to enhance temporal and channel-wise dependency modeling without increasing parameter complexity. Extensive experiments on benchmark datasets demonstrate state-of-the-art results with mAP scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS. Ablation studies validate our method's effectiveness, showing that the Dual structure in the Stem module and the recurrent mechanism outperform traditional approaches. Our findings demonstrate the potential of S6-based models in TAL tasks, paving the way for future research.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | HACS | Average-mAP | 45.8 | RDFA-S6 (InternVideo2-6B) |
| Video | HACS | mAP@0.5 | 66.4 | RDFA-S6 (InternVideo2-6B) |
| Video | HACS | mAP@0.75 | 47.2 | RDFA-S6 (InternVideo2-6B) |
| Video | HACS | mAP@0.95 | 14.3 | RDFA-S6 (InternVideo2-6B) |
| Video | ActivityNet-1.3 | mAP | 42.9 | RDFA-S6 (InternVideo2-6B) |
| Video | ActivityNet-1.3 | mAP IOU@0.5 | 64.1 | RDFA-S6 (InternVideo2-6B) |
| Video | ActivityNet-1.3 | mAP IOU@0.75 | 44 | RDFA-S6 (InternVideo2-6B) |
| Video | ActivityNet-1.3 | mAP IOU@0.95 | 10.6 | RDFA-S6 (InternVideo2-6B) |
| Video | FineAction | mAP | 29.6 | RDFA-S6 (InternVideo2-6B) |
| Video | FineAction | mAP IOU@0.5 | 46.4 | RDFA-S6 (InternVideo2-6B) |
| Video | FineAction | mAP IOU@0.75 | 29.5 | RDFA-S6 (InternVideo2-6B) |
| Video | FineAction | mAP IOU@0.95 | 7.6 | RDFA-S6 (InternVideo2-6B) |
| Video | THUMOS’14 | Avg mAP (0.3:0.7) | 74.2 | RDFA-S6 (InternVideo2-6B) |
| Video | THUMOS’14 | mAP IOU@0.3 | 88.7 | RDFA-S6 (InternVideo2-6B) |
| Video | THUMOS’14 | mAP IOU@0.4 | 84.6 | RDFA-S6 (InternVideo2-6B) |
| Video | THUMOS’14 | mAP IOU@0.5 | 78.2 | RDFA-S6 (InternVideo2-6B) |
| Video | THUMOS’14 | mAP IOU@0.6 | 66.6 | RDFA-S6 (InternVideo2-6B) |
| Video | THUMOS’14 | mAP IOU@0.7 | 51.9 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | HACS | Average-mAP | 45.8 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | HACS | mAP@0.5 | 66.4 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | HACS | mAP@0.75 | 47.2 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | HACS | mAP@0.95 | 14.3 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | ActivityNet-1.3 | mAP | 42.9 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.5 | 64.1 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.75 | 44 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | ActivityNet-1.3 | mAP IOU@0.95 | 10.6 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | FineAction | mAP | 29.6 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | FineAction | mAP IOU@0.5 | 46.4 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | FineAction | mAP IOU@0.75 | 29.5 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | FineAction | mAP IOU@0.95 | 7.6 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 74.2 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.3 | 88.7 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.4 | 84.6 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.5 | 78.2 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.6 | 66.6 | RDFA-S6 (InternVideo2-6B) |
| Temporal Action Localization | THUMOS’14 | mAP IOU@0.7 | 51.9 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | HACS | Average-mAP | 45.8 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | HACS | mAP@0.5 | 66.4 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | HACS | mAP@0.75 | 47.2 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | HACS | mAP@0.95 | 14.3 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP | 42.9 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.5 | 64.1 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.75 | 44 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | ActivityNet-1.3 | mAP IOU@0.95 | 10.6 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | FineAction | mAP | 29.6 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | FineAction | mAP IOU@0.5 | 46.4 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | FineAction | mAP IOU@0.75 | 29.5 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | FineAction | mAP IOU@0.95 | 7.6 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | THUMOS’14 | Avg mAP (0.3:0.7) | 74.2 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.3 | 88.7 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.4 | 84.6 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.5 | 78.2 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.6 | 66.6 | RDFA-S6 (InternVideo2-6B) |
| Zero-Shot Learning | THUMOS’14 | mAP IOU@0.7 | 51.9 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | HACS | Average-mAP | 45.8 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | HACS | mAP@0.5 | 66.4 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | HACS | mAP@0.75 | 47.2 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | HACS | mAP@0.95 | 14.3 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | ActivityNet-1.3 | mAP | 42.9 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.5 | 64.1 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.75 | 44 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | ActivityNet-1.3 | mAP IOU@0.95 | 10.6 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | FineAction | mAP | 29.6 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | FineAction | mAP IOU@0.5 | 46.4 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | FineAction | mAP IOU@0.75 | 29.5 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | FineAction | mAP IOU@0.95 | 7.6 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 74.2 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | THUMOS’14 | mAP IOU@0.3 | 88.7 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | THUMOS’14 | mAP IOU@0.4 | 84.6 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | THUMOS’14 | mAP IOU@0.5 | 78.2 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | THUMOS’14 | mAP IOU@0.6 | 66.6 | RDFA-S6 (InternVideo2-6B) |
| Action Localization | THUMOS’14 | mAP IOU@0.7 | 51.9 | RDFA-S6 (InternVideo2-6B) |