Two-Stream Convolutional Networks for Action Recognition in Videos

Karen Simonyan, Andrew Zisserman

2014-06-09NeurIPS 2014 12Action Classification Optical Flow Estimation Multi-Task Learning Video Classification General Classification Action Recognition Action Recognition In Videos Temporal Action Localization Vocal Bursts Valence Prediction

Paper PDF Code(official)Code Code Code Code Code Code

Abstract

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

Results

Task	Dataset	Metric	Value	Model
Video	Charades	MAP	18.6	2-Strm
Activity Recognition	HMDB-51	Average accuracy of 3 splits	59.4	Two-Stream (ImageNet pretrained)
Activity Recognition	UCF101	3-fold Accuracy	88	Two-Stream (ImageNet pretrained)
Hand	VIVA Hand Gestures Dataset	Accuracy	68	Two Stream CNNs
Gesture Recognition	VIVA Hand Gestures Dataset	Accuracy	68	Two Stream CNNs
Action Recognition	HMDB-51	Average accuracy of 3 splits	59.4	Two-Stream (ImageNet pretrained)
Action Recognition	UCF101	3-fold Accuracy	88	Two-Stream (ImageNet pretrained)

Two-Stream Convolutional Networks for Action Recognition in Videos

Abstract

Results

Related Papers

Two-Stream Convolutional Networks for Action Recognition in Videos

Abstract

Results

Related Papers