ActionFlowNet: Learning Motion Representation for Action Recognition

Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, Larry S. Davis

2016-12-09Optical Flow Estimation Action Recognition Temporal Action Localization

Abstract

Even with the recent advances in convolutional neural networks (CNN) in various visual recognition tasks, the state-of-the-art action recognition system still relies on hand crafted motion feature such as optical flow to achieve the best performance. We propose a multitask learning model ActionFlowNet to train a single stream network directly from raw pixels to jointly estimate optical flow while recognizing actions with convolutional neural networks, capturing both appearance and motion in a single model. We additionally provide insights to how the quality of the learned optical flow affects the action recognition. Our model significantly improves action recognition accuracy by a large margin 31% compared to state-of-the-art CNN-based action recognition models trained without external large scale data and additional optical flow input. Without pretraining on large external labeled datasets, our model, by well exploiting the motion information, achieves competitive recognition accuracy to the models trained with large labeled datasets such as ImageNet and Sport-1M.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	HMDB-51	Average accuracy of 3 splits	56.4	ActionFlowNet
Activity Recognition	UCF101	3-fold Accuracy	83.9	ActionFlowNet
Action Recognition	HMDB-51	Average accuracy of 3 splits	56.4	ActionFlowNet
Action Recognition	UCF101	3-fold Accuracy	83.9	ActionFlowNet

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11 Learning to Track Any Points from Human Motion2025-07-08 TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation2025-07-07 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation2025-06-29