TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Quo Vadis, Action Recognition? A New Model and the Kinetic...

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira, Andrew Zisserman

2017-05-22CVPR 2017 7Action ClassificationSkeleton Based Action RecognitionGeneral ClassificationAction RecognitionVideo Object Tracking
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.

Results

TaskDatasetMetricValueModel
VideoJ-HMDBAccuracy (RGB+pose)84.1I3D
VideoCharadesMAP32.9I3D
VideoToyota Smarthome datasetCS53.4I3D
VideoToyota Smarthome datasetCV134.9I3D
VideoToyota Smarthome datasetCV245.1I3D
VideoKinetics-400Acc@171.1I3D
VideoKinetics-400Acc@589.3I3D
VideoCATERL11.2I3D-50 + LSTM
VideoCATERTop 1 Accuracy60.2I3D-50 + LSTM
VideoCATERTop 5 Accuracy81.8I3D-50 + LSTM
Temporal Action LocalizationJ-HMDBAccuracy (RGB+pose)84.1I3D
Zero-Shot LearningJ-HMDBAccuracy (RGB+pose)84.1I3D
Activity RecognitionHMDB-51Average accuracy of 3 splits80.9Two-stream I3D
Activity RecognitionHMDB-51Average accuracy of 3 splits80.7Two-Stream I3D (Imagenet+Kinetics pre-training)
Activity RecognitionHMDB-51Average accuracy of 3 splits77.3Flow-I3D (Kinetics pre-training)
Activity RecognitionHMDB-51Average accuracy of 3 splits77.1Flow-I3D (Imagenet+Kinetics pre-training)
Activity RecognitionHMDB-51Average accuracy of 3 splits74.8RGB-I3D (Imagenet+Kinetics pre-training)
Activity RecognitionHMDB-51Average accuracy of 3 splits74.3RGB-I3D (Kinetics pre-training)
Activity RecognitionUCF1013-fold Accuracy98Two-Stream I3D (Imagenet+Kinetics pre-training)
Activity RecognitionUCF1013-fold Accuracy97.8Two-Stream I3D (Kinetics pre-training)
Activity RecognitionUCF1013-fold Accuracy96.7Flow-I3D (Imagenet+Kinetics pre-training)
Activity RecognitionUCF1013-fold Accuracy96.5Flow-I3D (Kinetics pre-training)
Activity RecognitionUCF1013-fold Accuracy95.6RGB-I3D (Imagenet+Kinetics pre-training)
Activity RecognitionUCF1013-fold Accuracy95.1RGB-I3D (Kinetics pre-training)
Activity RecognitionUCF1013-fold Accuracy93.4Two-stream I3D
Activity RecognitionJ-HMDBAccuracy (RGB+pose)84.1I3D
Action LocalizationJ-HMDBAccuracy (RGB+pose)84.1I3D
HandEgoGestureAccuracy92.78I3D
HandVIVA Hand Gestures DatasetAccuracy83.1I3D
Action DetectionJ-HMDBAccuracy (RGB+pose)84.1I3D
Object TrackingCATERL11.2I3D-50 + LSTM
Object TrackingCATERTop 1 Accuracy60.2I3D-50 + LSTM
Object TrackingCATERTop 5 Accuracy81.8I3D-50 + LSTM
Gesture RecognitionEgoGestureAccuracy92.78I3D
Gesture RecognitionVIVA Hand Gestures DatasetAccuracy83.1I3D
3D Action RecognitionJ-HMDBAccuracy (RGB+pose)84.1I3D
Action RecognitionHMDB-51Average accuracy of 3 splits80.9Two-stream I3D
Action RecognitionHMDB-51Average accuracy of 3 splits80.7Two-Stream I3D (Imagenet+Kinetics pre-training)
Action RecognitionHMDB-51Average accuracy of 3 splits77.3Flow-I3D (Kinetics pre-training)
Action RecognitionHMDB-51Average accuracy of 3 splits77.1Flow-I3D (Imagenet+Kinetics pre-training)
Action RecognitionHMDB-51Average accuracy of 3 splits74.8RGB-I3D (Imagenet+Kinetics pre-training)
Action RecognitionHMDB-51Average accuracy of 3 splits74.3RGB-I3D (Kinetics pre-training)
Action RecognitionUCF1013-fold Accuracy98Two-Stream I3D (Imagenet+Kinetics pre-training)
Action RecognitionUCF1013-fold Accuracy97.8Two-Stream I3D (Kinetics pre-training)
Action RecognitionUCF1013-fold Accuracy96.7Flow-I3D (Imagenet+Kinetics pre-training)
Action RecognitionUCF1013-fold Accuracy96.5Flow-I3D (Kinetics pre-training)
Action RecognitionUCF1013-fold Accuracy95.6RGB-I3D (Imagenet+Kinetics pre-training)
Action RecognitionUCF1013-fold Accuracy95.1RGB-I3D (Kinetics pre-training)
Action RecognitionUCF1013-fold Accuracy93.4Two-stream I3D
Action RecognitionJ-HMDBAccuracy (RGB+pose)84.1I3D

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22