End-to-end Temporal Action Detection with Transformer

Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, Xiang Bai

2021-06-18Action Detection Video Understanding Temporal Action Localization

Abstract

Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.

Results

Task	Dataset	Metric	Value	Model
Video	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Video	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Video	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Video	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Video	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Video	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Video	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Video	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Video	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Video	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Video	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Video	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Video	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Video	THUMOS’14	mAP IOU@0.7	32.8	TadTR
Temporal Action Localization	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Temporal Action Localization	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Temporal Action Localization	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Temporal Action Localization	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Temporal Action Localization	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Temporal Action Localization	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Temporal Action Localization	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Temporal Action Localization	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.7	32.8	TadTR
Zero-Shot Learning	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Zero-Shot Learning	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Zero-Shot Learning	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Zero-Shot Learning	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Zero-Shot Learning	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Zero-Shot Learning	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Zero-Shot Learning	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Zero-Shot Learning	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.7	32.8	TadTR
Action Localization	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Action Localization	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Action Localization	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Action Localization	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Action Localization	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Action Localization	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Action Localization	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Action Localization	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Action Localization	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Action Localization	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Action Localization	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Action Localization	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Action Localization	THUMOS’14	mAP IOU@0.7	32.8	TadTR

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Video	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Video	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Video	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Video	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Video	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Video	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Video	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Video	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Video	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Video	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Video	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Video	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Video	THUMOS’14	mAP IOU@0.7	32.8	TadTR
Temporal Action Localization	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Temporal Action Localization	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Temporal Action Localization	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Temporal Action Localization	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Temporal Action Localization	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Temporal Action Localization	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Temporal Action Localization	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Temporal Action Localization	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Temporal Action Localization	THUMOS’14	mAP IOU@0.7	32.8	TadTR
Zero-Shot Learning	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Zero-Shot Learning	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Zero-Shot Learning	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Zero-Shot Learning	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Zero-Shot Learning	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Zero-Shot Learning	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Zero-Shot Learning	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Zero-Shot Learning	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Zero-Shot Learning	THUMOS’14	mAP IOU@0.7	32.8	TadTR
Action Localization	HACS	Average-mAP	32.09	TadTr (I3D RGB)
Action Localization	HACS	mAP@0.5	47.14	TadTr (I3D RGB)
Action Localization	HACS	mAP@0.75	32.11	TadTr (I3D RGB)
Action Localization	HACS	mAP@0.95	10.94	TadTr (I3D RGB)
Action Localization	ActivityNet-1.3	mAP	36.75	TadTR (TSP features)
Action Localization	ActivityNet-1.3	mAP IOU@0.5	53.62	TadTR (TSP features)
Action Localization	ActivityNet-1.3	mAP IOU@0.75	37.52	TadTR (TSP features)
Action Localization	ActivityNet-1.3	mAP IOU@0.95	10.56	TadTR (TSP features)
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	56.7	TadTR
Action Localization	THUMOS’14	mAP IOU@0.3	74.8	TadTR
Action Localization	THUMOS’14	mAP IOU@0.4	69.1	TadTR
Action Localization	THUMOS’14	mAP IOU@0.5	60.1	TadTR
Action Localization	THUMOS’14	mAP IOU@0.6	46.6	TadTR
Action Localization	THUMOS’14	mAP IOU@0.7	32.8	TadTR

End-to-end Temporal Action Detection with Transformer

Abstract

Results

Related Papers

End-to-end Temporal Action Detection with Transformer

Abstract

Results

Related Papers