TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VIMPAC: Video Pre-Training via Masked Token Prediction and...

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

2021-06-21Action ClassificationContrastive LearningVideo UnderstandingAction Recognition
PaperPDFCode(official)

Abstract

Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We pre-train our model on uncurated videos and show that our pre-trained model can reach state-of-the-art results on several video understanding datasets (e.g., SSV2, Diving48). Lastly, we provide detailed analyses on model scalability and pre-training method design. Code is released at https://github.com/airsplay/vimpac.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@177.4VIMPAC
Activity RecognitionDiving-48Accuracy85.5VIMPAC
Activity RecognitionHMDB-51Average accuracy of 3 splits65.9VIMPAC
Activity RecognitionSomething-Something V2Top-1 Accuracy68.1VIMPAC
Activity RecognitionUCF1013-fold Accuracy92.7VIMPAC
Action RecognitionDiving-48Accuracy85.5VIMPAC
Action RecognitionHMDB-51Average accuracy of 3 splits65.9VIMPAC
Action RecognitionSomething-Something V2Top-1 Accuracy68.1VIMPAC
Action RecognitionUCF1013-fold Accuracy92.7VIMPAC

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15