Generating Videos with Scene Dynamics

Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

2016-09-08NeurIPS 2016 12Action Classification Representation Learning Video Recognition Future prediction General Classification Video Understanding Video Generation Self-Supervised Action Recognition

Paper PDF

Abstract

We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation.

Results

Task	Dataset	Metric	Value	Model
Video	UCF-101 16 frames, Unconditional, Single GPU	Inception Score	8.18	VGAN
Video	UCF-101 16 frames, 64x64, Unconditional	Inception Score	8.18	VGAN
Activity Recognition	UCF101	3-fold Accuracy	52.1	VideoGan (C3D)
Action Recognition	UCF101	3-fold Accuracy	52.1	VideoGan (C3D)
Video Generation	UCF-101 16 frames, Unconditional, Single GPU	Inception Score	8.18	VGAN
Video Generation	UCF-101 16 frames, 64x64, Unconditional	Inception Score	8.18	VGAN

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17 Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17 Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17 LoViC: Efficient Long Video Generation with Context Compression2025-07-17