TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Closer Look at Spatiotemporal Convolutions for Action Re...

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Lecun, Manohar Paluri

2017-11-30CVPR 2018 6Action ClassificationAction RecognitionTemporal Action Localization
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@175.4R[2+1]D-Flow (Sports-1M pretrain)
VideoKinetics-400Acc@591.9R[2+1]D-Flow (Sports-1M pretrain)
VideoKinetics-400Acc@174.3R[2+1]D-RGB (Sports-1M pretrain)
VideoKinetics-400Acc@591.4R[2+1]D-RGB (Sports-1M pretrain)
VideoKinetics-400Acc@173.9R[2+1]D-Two-Stream
VideoKinetics-400Acc@590.9R[2+1]D-Two-Stream
VideoKinetics-400Acc@172R[2+1]D
VideoKinetics-400Acc@590R[2+1]D
VideoKinetics-400Acc@172R[2+1]D-RGB
VideoKinetics-400Acc@590R[2+1]D-RGB
VideoKinetics-400Acc@167.5R[2+1]D-Flow
VideoKinetics-400Acc@587.2R[2+1]D-Flow
Activity RecognitionSports-1MVideo hit@1 73.3R[2+1]D-Two-Stream-32frame
Activity RecognitionSports-1MVideo hit@591.9R[2+1]D-Two-Stream-32frame
Activity RecognitionSports-1MClip Hit@157R[2+1]D-RGB-32frame
Activity RecognitionSports-1MVideo hit@1 73R[2+1]D-RGB-32frame
Activity RecognitionSports-1MVideo hit@591.5R[2+1]D-RGB-32frame
Activity RecognitionSports-1MClip Hit@146.4R[2+1]D-Flow-32frame
Activity RecognitionSports-1MVideo hit@1 68.4R[2+1]D-Flow-32frame
Activity RecognitionSports-1MVideo hit@588.7R[2+1]D-Flow-32frame
Activity RecognitionHMDB-51Average accuracy of 3 splits78.7R[2+1]D-TwoStream (Kinetics pretrained)
Activity RecognitionHMDB-51Average accuracy of 3 splits76.4R[2+1]D-Flow (Kinetics pretrained)
Activity RecognitionHMDB-51Average accuracy of 3 splits74.5R[2+1]D-RGB (Kinetics pretrained)
Activity RecognitionHMDB-51Average accuracy of 3 splits72.7R[2+1D]D-TwoStream (Sports1M pretrained)
Activity RecognitionHMDB-51Average accuracy of 3 splits70.1R[2+1]D-Flow (Sports1M pretrained)
Activity RecognitionHMDB-51Average accuracy of 3 splits66.6R[2+1]D-RGB (Sports1M pretrained)
Activity RecognitionUCF1013-fold Accuracy97.3R[2+1]D-TwoStream (Kinetics pretrained)
Activity RecognitionUCF1013-fold Accuracy96.8R[2+1]D-RGB (Kinetics pretrained)
Activity RecognitionUCF1013-fold Accuracy95.5R[2+1]D-Flow (Kinetics pretrained)
Activity RecognitionUCF1013-fold Accuracy95R[2+1]D-TwoStream (Sports-1M pretrained)
Activity RecognitionUCF1013-fold Accuracy93.6R[2+1]D-RGB (Sports-1M pretrained)
Activity RecognitionUCF1013-fold Accuracy93.3R[2+1]D-Flow (Sports-1M pretrained)
Action RecognitionSports-1MVideo hit@1 73.3R[2+1]D-Two-Stream-32frame
Action RecognitionSports-1MVideo hit@591.9R[2+1]D-Two-Stream-32frame
Action RecognitionSports-1MClip Hit@157R[2+1]D-RGB-32frame
Action RecognitionSports-1MVideo hit@1 73R[2+1]D-RGB-32frame
Action RecognitionSports-1MVideo hit@591.5R[2+1]D-RGB-32frame
Action RecognitionSports-1MClip Hit@146.4R[2+1]D-Flow-32frame
Action RecognitionSports-1MVideo hit@1 68.4R[2+1]D-Flow-32frame
Action RecognitionSports-1MVideo hit@588.7R[2+1]D-Flow-32frame
Action RecognitionHMDB-51Average accuracy of 3 splits78.7R[2+1]D-TwoStream (Kinetics pretrained)
Action RecognitionHMDB-51Average accuracy of 3 splits76.4R[2+1]D-Flow (Kinetics pretrained)
Action RecognitionHMDB-51Average accuracy of 3 splits74.5R[2+1]D-RGB (Kinetics pretrained)
Action RecognitionHMDB-51Average accuracy of 3 splits72.7R[2+1D]D-TwoStream (Sports1M pretrained)
Action RecognitionHMDB-51Average accuracy of 3 splits70.1R[2+1]D-Flow (Sports1M pretrained)
Action RecognitionHMDB-51Average accuracy of 3 splits66.6R[2+1]D-RGB (Sports1M pretrained)
Action RecognitionUCF1013-fold Accuracy97.3R[2+1]D-TwoStream (Kinetics pretrained)
Action RecognitionUCF1013-fold Accuracy96.8R[2+1]D-RGB (Kinetics pretrained)
Action RecognitionUCF1013-fold Accuracy95.5R[2+1]D-Flow (Kinetics pretrained)
Action RecognitionUCF1013-fold Accuracy95R[2+1]D-TwoStream (Sports-1M pretrained)
Action RecognitionUCF1013-fold Accuracy93.6R[2+1]D-RGB (Sports-1M pretrained)
Action RecognitionUCF1013-fold Accuracy93.3R[2+1]D-Flow (Sports-1M pretrained)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22