Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, LiMin Wang
Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e.g., ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric. Code is available at https://github.com/MCG-NJU/PointTAD.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MultiTHUMOS | Average mAP | 23.5 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.1 | 42.3 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.2 | 39.7 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.3 | 35.8 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.4 | 30.9 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.5 | 24.9 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.6 | 18.5 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.7 | 12 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.8 | 5.6 | PointTAD |
| Video | MultiTHUMOS | mAP IOU@0.9 | 1.4 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | Average mAP | 23.5 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.1 | 42.3 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.2 | 39.7 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.3 | 35.8 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.4 | 30.9 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.5 | 24.9 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.6 | 18.5 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.7 | 12 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.8 | 5.6 | PointTAD |
| Temporal Action Localization | MultiTHUMOS | mAP IOU@0.9 | 1.4 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | Average mAP | 23.5 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.1 | 42.3 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.2 | 39.7 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.3 | 35.8 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.4 | 30.9 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.5 | 24.9 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.6 | 18.5 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.7 | 12 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.8 | 5.6 | PointTAD |
| Zero-Shot Learning | MultiTHUMOS | mAP IOU@0.9 | 1.4 | PointTAD |
| Action Localization | MultiTHUMOS | Average mAP | 23.5 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.1 | 42.3 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.2 | 39.7 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.3 | 35.8 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.4 | 30.9 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.5 | 24.9 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.6 | 18.5 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.7 | 12 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.8 | 5.6 | PointTAD |
| Action Localization | MultiTHUMOS | mAP IOU@0.9 | 1.4 | PointTAD |