TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CCVS: Context-aware Controllable Video Synthesis

CCVS: Context-aware Controllable Video Synthesis

Guillaume Le Moing, Jean Ponce, Cordelia Schmid

2021-07-16NeurIPS 2021 12Optical Flow EstimationVideo PredictionSelf-Supervised LearningVideo Generation
PaperPDFCode(official)

Abstract

This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

Results

TaskDatasetMetricValueModel
VideoBAIR Robot PushingCond1CCVS
VideoBAIR Robot PushingPred15CCVS
VideoBAIR Robot PushingTrain15CCVS
VideoKinetics-600 12 frames, 64x64Cond5CCVS
VideoKinetics-600 12 frames, 64x64Pred11CCVS
Video PredictionKinetics-600 12 frames, 64x64Cond5CCVS
Video PredictionKinetics-600 12 frames, 64x64Pred11CCVS
Video GenerationBAIR Robot PushingCond1CCVS
Video GenerationBAIR Robot PushingPred15CCVS
Video GenerationBAIR Robot PushingTrain15CCVS

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12