Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

2014-12-02ICCV 2015 12Action Recognition Action Recognition In Videos Dynamic Facial Expression Recognition

Paper PDF Code Code(official)Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Sports-1M	Clip Hit@1	46.1	C3D
Activity Recognition	Sports-1M	Video hit@1	61.1	C3D
Activity Recognition	Sports-1M	Video hit@5	85.5	C3D
Activity Recognition	HMDB-51	Average accuracy of 3 splits	51.6	C3D
Activity Recognition	UCF101	3-fold Accuracy	82.3	C3D
Action Recognition	Sports-1M	Clip Hit@1	46.1	C3D
Action Recognition	Sports-1M	Video hit@1	61.1	C3D
Action Recognition	Sports-1M	Video hit@5	85.5	C3D
Action Recognition	HMDB-51	Average accuracy of 3 splits	51.6	C3D
Action Recognition	UCF101	3-fold Accuracy	82.3	C3D

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25 Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation2025-06-25 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Adapting Vision-Language Models for Evaluating World Models2025-06-22