TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Two-Stream Convolutional Networks for Action Recognition i...

Two-Stream Convolutional Networks for Action Recognition in Videos

Karen Simonyan, Andrew Zisserman

2014-06-09NeurIPS 2014 12Action ClassificationOptical Flow EstimationMulti-Task LearningVideo ClassificationGeneral ClassificationAction RecognitionAction Recognition In VideosTemporal Action LocalizationVocal Bursts Valence Prediction
PaperPDFCode(official)CodeCodeCodeCodeCodeCode

Abstract

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

Results

TaskDatasetMetricValueModel
VideoCharadesMAP18.62-Strm
Activity RecognitionHMDB-51Average accuracy of 3 splits59.4Two-Stream (ImageNet pretrained)
Activity RecognitionUCF1013-fold Accuracy88Two-Stream (ImageNet pretrained)
HandVIVA Hand Gestures DatasetAccuracy68Two Stream CNNs
Gesture RecognitionVIVA Hand Gestures DatasetAccuracy68Two Stream CNNs
Action RecognitionHMDB-51Average accuracy of 3 splits59.4Two-Stream (ImageNet pretrained)
Action RecognitionUCF1013-fold Accuracy88Two-Stream (ImageNet pretrained)

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation2025-07-10Learning to Track Any Points from Human Motion2025-07-08