TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video Representation Learning by Dense Predictive Coding

Video Representation Learning by Dense Predictive Coding

Tengda Han, Weidi Xie, Andrew Zisserman

2019-09-10Self-Supervised Action Recognition LinearRepresentation LearningSelf-Supervised LearningAction RecognitionTemporal Action LocalizationSelf-Supervised Action Recognition
PaperPDFCode(official)

Abstract

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

Results

TaskDatasetMetricValueModel
Activity RecognitionUCF1013-fold Accuracy75.7DPC (Modified 3D Resnet-34)
Activity RecognitionUCF1013-fold Accuracy68.2DPC (3D ResNet-18)
Activity RecognitionUCF1013-fold Accuracy60.6DPC (3D ResNet-18, Split 1)
Activity RecognitionHMDB51Top-1 Accuracy35.7DPC (Modified 3D Resnet-34)
Activity RecognitionHMDB51Top-1 Accuracy34.5DPC (Modified 3D ResNet-18)
Action RecognitionUCF1013-fold Accuracy75.7DPC (Modified 3D Resnet-34)
Action RecognitionUCF1013-fold Accuracy68.2DPC (3D ResNet-18)
Action RecognitionUCF1013-fold Accuracy60.6DPC (3D ResNet-18, Split 1)
Action RecognitionHMDB51Top-1 Accuracy35.7DPC (Modified 3D Resnet-34)
Action RecognitionHMDB51Top-1 Accuracy34.5DPC (Modified 3D ResNet-18)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16