R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Huijuan Xu, Abir Das, Kate Saenko

2017-03-22ICCV 2017 10Action Detection Activity Detection General Classification Action Recognition In Videos

Abstract

We address the problem of activity detection in continuous, untrimmed video streams. This is a difficult task that requires extracting meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity. We introduce a new model, Region Convolutional 3D Network (R-C3D), which encodes the video streams using a three-dimensional fully convolutional network, then generates candidate temporal regions containing activities, and finally classifies selected regions into specific activities. Computation is saved due to the sharing of convolutional features between the proposal and the classification pipelines. The entire model is trained end-to-end with jointly optimized localization and classification losses. R-C3D is faster than existing methods (569 frames per second on a single Titan X Maxwell GPU) and achieves state-of-the-art results on THUMOS'14. We further demonstrate that our model is a general activity detection framework that does not rely on assumptions about particular dataset properties by evaluating our approach on ActivityNet and Charades. Our code is available at http://ai.bu.edu/r-c3d/.

Results

Task	Dataset	Metric	Value	Model
Video	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Video	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Video	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Video	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Video	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Activity Recognition	THUMOS’14	mAP@0.1	54.5	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.2	51.5	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.3	44.8	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.4	35.6	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.5	28.9	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.1	51.6	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.2	49.2	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.3	42.8	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.4	33.4	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.5	27	Single-stream R-C3D (one-way buffer)
Action Localization	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Action Detection	Charades	mAP	12.4	R-C3D
Action Recognition	THUMOS’14	mAP@0.1	54.5	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.2	51.5	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.3	44.8	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.4	35.6	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.5	28.9	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.1	51.6	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.2	49.2	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.3	42.8	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.4	33.4	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.5	27	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.1	54.5	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.2	51.5	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.3	44.8	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.4	35.6	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.5	28.9	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.1	51.6	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.2	49.2	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.3	42.8	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.4	33.4	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.5	27	Single-stream R-C3D (one-way buffer)

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Video	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Video	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Video	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Video	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Temporal Action Localization	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Zero-Shot Learning	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Activity Recognition	THUMOS’14	mAP@0.1	54.5	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.2	51.5	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.3	44.8	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.4	35.6	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.5	28.9	Single-stream R-C3D (two-way buffer)
Activity Recognition	THUMOS’14	mAP@0.1	51.6	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.2	49.2	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.3	42.8	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.4	33.4	Single-stream R-C3D (one-way buffer)
Activity Recognition	THUMOS’14	mAP@0.5	27	Single-stream R-C3D (one-way buffer)
Action Localization	THUMOS’14	mAP IOU@0.1	54.5	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.2	51.5	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.3	44.8	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.4	35.6	R-C3D
Action Localization	THUMOS’14	mAP IOU@0.5	28.9	R-C3D
Action Detection	Charades	mAP	12.4	R-C3D
Action Recognition	THUMOS’14	mAP@0.1	54.5	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.2	51.5	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.3	44.8	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.4	35.6	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.5	28.9	Single-stream R-C3D (two-way buffer)
Action Recognition	THUMOS’14	mAP@0.1	51.6	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.2	49.2	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.3	42.8	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.4	33.4	Single-stream R-C3D (one-way buffer)
Action Recognition	THUMOS’14	mAP@0.5	27	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.1	54.5	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.2	51.5	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.3	44.8	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.4	35.6	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.5	28.9	Single-stream R-C3D (two-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.1	51.6	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.2	49.2	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.3	42.8	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.4	33.4	Single-stream R-C3D (one-way buffer)
Action Recognition In Videos	THUMOS’14	mAP@0.5	27	Single-stream R-C3D (one-way buffer)

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Abstract

Results

Related Papers

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Abstract

Results

Related Papers