Towards Good Practices for Very Deep Two-Stream ConvNets

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao

2015-07-08Data Augmentation Action Recognition Action Recognition In Videos Temporal Action Localization Vocal Bursts Valence Prediction

Paper PDF Code Code Code Code(official)Code

Abstract

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of $91.4\%$.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	UCF101	3-fold Accuracy	91.4	Very deep two-stream ConvNet
Action Recognition	UCF101	3-fold Accuracy	91.4	Very deep two-stream ConvNet

Related Papers

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 Data Augmentation in Time Series Forecasting through Inverted Framework2025-07-15 Iceberg: Enhancing HLS Modeling with Synthetic Data2025-07-14 AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13