TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoMoCo: Contrastive Video Representation Learning with ...

VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, Wei Liu

2021-03-10CVPR 2021 1Self-Supervised Action Recognition LinearRepresentation LearningContrastive LearningAction Recognition
PaperPDFCode(official)

Abstract

MoCo is effective for unsupervised image representation learning. In this paper, we propose VideoMoCo for unsupervised video representation learning. Given a video sequence as an input sample, we improve the temporal feature representations of MoCo from two perspectives. First, we introduce a generator to drop out several frames from this sample temporally. The discriminator is then learned to encode similar feature representations regardless of frame removals. By adaptively dropping out different frames during training iterations of adversarial learning, we augment this input sample to train a temporally robust encoder. Second, we use temporal decay to model key attenuation in the memory queue when computing the contrastive loss. As the momentum encoder updates after keys enqueue, the representation ability of these keys degrades when we use the current input sample for contrastive learning. This degradation is reflected via temporal decay to attend the input sample to recent keys in the queue. As a result, we adapt MoCo to learn video representations without empirically designing pretext tasks. By empowering the temporal robustness of the encoder and modeling the temporal decay of the keys, our VideoMoCo improves MoCo temporally based on contrastive learning. Experiments on benchmark datasets including UCF101 and HMDB51 show that VideoMoCo stands as a state-of-the-art video representation learning method.

Results

TaskDatasetMetricValueModel
Activity RecognitionHMDB-51Average accuracy of 3 splits49.2R[2+1]D (VideoMoCo)
Activity RecognitionHMDB-51Average accuracy of 3 splits43.63D-ResNet-18 (VideoMoCo)
Activity RecognitionUCF1013-fold Accuracy78.7R[2+1]D (VideoMoCo)
Activity RecognitionUCF1013-fold Accuracy74.13D-ResNet-18 (VideoMoCo)
Action RecognitionHMDB-51Average accuracy of 3 splits49.2R[2+1]D (VideoMoCo)
Action RecognitionHMDB-51Average accuracy of 3 splits43.63D-ResNet-18 (VideoMoCo)
Action RecognitionUCF1013-fold Accuracy78.7R[2+1]D (VideoMoCo)
Action RecognitionUCF1013-fold Accuracy74.13D-ResNet-18 (VideoMoCo)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17