TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/D3D: Distilled 3D Networks for Video Action Recognition

D3D: Distilled 3D Networks for Video Action Recognition

Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar

2018-12-19Action ClassificationOptical Flow EstimationAction RecognitionTemporal Action Localization
PaperPDFCode

Abstract

State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the temporal filters should allow the spatial stream to learn motion representations, making the temporal stream redundant. However, we still see significant benefits in action recognition performance by including an entirely separate temporal stream, indicating that the spatial stream is "missing" some of the signal captured by the temporal stream. In this work, we first investigate whether motion representations are indeed missing in the spatial stream of 3D CNNs. Second, we demonstrate that these motion representations can be improved by distillation, by tuning the spatial stream to predict the outputs of the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with two-stream approaches, using only a single model and with no need to compute optical flow.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@176.5D3D+S3D-G (RGB + RGB)
VideoKinetics-400Acc@175.9D3D (RGB)
VideoKinetics-600Top-1 Accuracy79.1D3D+S3D-G
VideoKinetics-600Top-1 Accuracy77.9D3D
Activity RecognitionHMDB-51Average accuracy of 3 splits80.5D3D + D3D
Activity RecognitionHMDB-51Average accuracy of 3 splits79.3D3D (Kinetics-600 pretraining)
Activity RecognitionHMDB-51Average accuracy of 3 splits78.7D3D (Kinetics-400 pretraining)
Activity RecognitionAVA v2.1mAP (Val)23D3D (ResNet RPN, Kinetics-400 pretraining)
Activity RecognitionUCF1013-fold Accuracy97.6D3D + D3D
Activity RecognitionUCF1013-fold Accuracy97.1D3D (Kinetics-600 pretraining)
Activity RecognitionUCF1013-fold Accuracy97D3D (Kinetics-400 pretraining)
Action RecognitionHMDB-51Average accuracy of 3 splits80.5D3D + D3D
Action RecognitionHMDB-51Average accuracy of 3 splits79.3D3D (Kinetics-600 pretraining)
Action RecognitionHMDB-51Average accuracy of 3 splits78.7D3D (Kinetics-400 pretraining)
Action RecognitionAVA v2.1mAP (Val)23D3D (ResNet RPN, Kinetics-400 pretraining)
Action RecognitionUCF1013-fold Accuracy97.6D3D + D3D
Action RecognitionUCF1013-fold Accuracy97.1D3D (Kinetics-600 pretraining)
Action RecognitionUCF1013-fold Accuracy97D3D (Kinetics-400 pretraining)

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11Learning to Track Any Points from Human Motion2025-07-08TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation2025-07-07Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation2025-06-29