TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-supervised Video Representation Learning with Cross-S...

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Martine Toering, Ioannis Gatopoulos, Maarten Stol, Vincent Tao Hu

2021-06-18Video RetrievalRepresentation LearningOptical Flow EstimationVideo RecognitionSelf-Supervised LearningData AugmentationContrastive LearningVideo ClassificationAction RecognitionRetrievalAction Recognition In VideosSelf-supervised Video RetrievalSelf-Supervised Action Recognition
PaperPDFCode(official)

Abstract

Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion information are learned, without the explicit need for optical flow computation during inference. We obtain state-of-the-art results on nearest-neighbour video retrieval and action recognition, outperforming previous best by +3.2% on UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and +15.1% on HMDB51 using the R(2+1)D backbone.

Results

TaskDatasetMetricValueModel
Activity RecognitionUCF101 (finetuned)3-fold Accuracy90.5ViCC (S3D; R+F)
Activity RecognitionUCF101 (finetuned)3-fold Accuracy88.8ViCC (R2+1D; R+F)
Activity RecognitionUCF101 (finetuned)3-fold Accuracy84.3ViCC (S3D; RGB)
Activity RecognitionUCF101 (finetuned)3-fold Accuracy82.8ViCC (R2+1D; RGB)
Activity RecognitionUCF1013-fold Accuracy90.5ViCC (S3D; R+F)
Activity RecognitionUCF1013-fold Accuracy88.8ViCC (S3D; RGB)
Activity RecognitionUCF1013-fold Accuracy88.8ViCC (R2+1D; R+F)
Activity RecognitionUCF1013-fold Accuracy82.8ViCC (R2+1D; RGB)
Activity RecognitionUCF1013-fold Accuracy72.2ViCC (S3D; RGB)
Activity RecognitionHMDB51Top-1 Accuracy62.2ViCC (S3D; R+F)
Activity RecognitionHMDB51Top-1 Accuracy61.5ViCC (R2+1D; R+F)
Activity RecognitionHMDB51Top-1 Accuracy52.4ViCC (R2+1D; RGB)
Activity RecognitionHMDB51Top-1 Accuracy38.5ViCC (S3D; RGB)
Activity RecognitionHMDB51 (finetuned)Top-1 Accuracy62.2ViCC (S3D; R+F)
Activity RecognitionHMDB51 (finetuned)Top-1 Accuracy61.5ViCC (R2+1D; R+F)
Activity RecognitionHMDB51 (finetuned)Top-1 Accuracy52.4ViCC (R2+1D; RGB)
Activity RecognitionHMDB51 (finetuned)Top-1 Accuracy47.9ViCC (S3D; RGB))
Action RecognitionUCF101 (finetuned)3-fold Accuracy90.5ViCC (S3D; R+F)
Action RecognitionUCF101 (finetuned)3-fold Accuracy88.8ViCC (R2+1D; R+F)
Action RecognitionUCF101 (finetuned)3-fold Accuracy84.3ViCC (S3D; RGB)
Action RecognitionUCF101 (finetuned)3-fold Accuracy82.8ViCC (R2+1D; RGB)
Action RecognitionUCF1013-fold Accuracy90.5ViCC (S3D; R+F)
Action RecognitionUCF1013-fold Accuracy88.8ViCC (S3D; RGB)
Action RecognitionUCF1013-fold Accuracy88.8ViCC (R2+1D; R+F)
Action RecognitionUCF1013-fold Accuracy82.8ViCC (R2+1D; RGB)
Action RecognitionUCF1013-fold Accuracy72.2ViCC (S3D; RGB)
Action RecognitionHMDB51Top-1 Accuracy62.2ViCC (S3D; R+F)
Action RecognitionHMDB51Top-1 Accuracy61.5ViCC (R2+1D; R+F)
Action RecognitionHMDB51Top-1 Accuracy52.4ViCC (R2+1D; RGB)
Action RecognitionHMDB51Top-1 Accuracy38.5ViCC (S3D; RGB)
Action RecognitionHMDB51 (finetuned)Top-1 Accuracy62.2ViCC (S3D; R+F)
Action RecognitionHMDB51 (finetuned)Top-1 Accuracy61.5ViCC (R2+1D; R+F)
Action RecognitionHMDB51 (finetuned)Top-1 Accuracy52.4ViCC (R2+1D; RGB)
Action RecognitionHMDB51 (finetuned)Top-1 Accuracy47.9ViCC (S3D; RGB))

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17