TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MCVD: Masked Conditional Video Diffusion for Prediction, G...

MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Vikram Voleti, Alexia Jolicoeur-Martineau, Christopher Pal

2022-05-19DenoisingVideo PredictionPredictionVideo Generation
PaperPDFCode(official)Code

Abstract

Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. Project page: https://mask-cond-video-diffusion.github.io ; Code : https://github.com/voletiv/mcvd-pytorch

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD161143MCVD (64x64)
VideoBAIR Robot PushingCond2MCVD : c2t5p14
VideoBAIR Robot PushingFVD score87.9MCVD : c2t5p14
VideoBAIR Robot PushingPSNR19.1MCVD : c2t5p14
VideoBAIR Robot PushingPred14MCVD : c2t5p14
VideoBAIR Robot PushingSSIM0.838MCVD : c2t5p14
VideoBAIR Robot PushingTrain5MCVD : c2t5p14
VideoBAIR Robot PushingCond1MCVD : c1t5p15
VideoBAIR Robot PushingFVD score89.5MCVD : c1t5p15
VideoBAIR Robot PushingPSNR16.9MCVD : c1t5p15
VideoBAIR Robot PushingPred15MCVD : c1t5p15
VideoBAIR Robot PushingSSIM0.78MCVD : c1t5p15
VideoBAIR Robot PushingTrain5MCVD : c1t5p15
VideoBAIR Robot PushingCond2MCVD : c2t5p28
VideoBAIR Robot PushingFVD score118.4MCVD : c2t5p28
VideoBAIR Robot PushingPSNR16.2MCVD : c2t5p28
VideoBAIR Robot PushingPred28MCVD : c2t5p28
VideoBAIR Robot PushingSSIM0.745MCVD : c2t5p28
VideoBAIR Robot PushingTrain5MCVD : c2t5p28
Video GenerationUCF-101FVD161143MCVD (64x64)
Video GenerationBAIR Robot PushingCond2MCVD : c2t5p14
Video GenerationBAIR Robot PushingFVD score87.9MCVD : c2t5p14
Video GenerationBAIR Robot PushingPSNR19.1MCVD : c2t5p14
Video GenerationBAIR Robot PushingPred14MCVD : c2t5p14
Video GenerationBAIR Robot PushingSSIM0.838MCVD : c2t5p14
Video GenerationBAIR Robot PushingTrain5MCVD : c2t5p14
Video GenerationBAIR Robot PushingCond1MCVD : c1t5p15
Video GenerationBAIR Robot PushingFVD score89.5MCVD : c1t5p15
Video GenerationBAIR Robot PushingPSNR16.9MCVD : c1t5p15
Video GenerationBAIR Robot PushingPred15MCVD : c1t5p15
Video GenerationBAIR Robot PushingSSIM0.78MCVD : c1t5p15
Video GenerationBAIR Robot PushingTrain5MCVD : c1t5p15
Video GenerationBAIR Robot PushingCond2MCVD : c2t5p28
Video GenerationBAIR Robot PushingFVD score118.4MCVD : c2t5p28
Video GenerationBAIR Robot PushingPSNR16.2MCVD : c2t5p28
Video GenerationBAIR Robot PushingPred28MCVD : c2t5p28
Video GenerationBAIR Robot PushingSSIM0.745MCVD : c2t5p28
Video GenerationBAIR Robot PushingTrain5MCVD : c2t5p28

Related Papers

Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16