Video Pixel Networks

Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu

2016-10-03ICML 2017 8Video Prediction

Abstract

We propose a probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video. The model and the neural architecture reflect the time, space and color structure of video tensors and encode it as a four-dimensional dependency chain. The VPN approaches the best possible performance on the Moving MNIST benchmark, a leap over the previous state of the art, and the generated videos show only minor deviations from the ground truth. The VPN also produces detailed samples on the action-conditional Robotic Pushing benchmark and generalizes to the motion of novel objects.

Results

Task	Dataset	Metric	Value	Model
Video	KTH	Cond	10	VPN
Video	KTH	PSNR	23.76	VPN
Video	KTH	Pred	20	VPN
Video	KTH	SSIM	0.746	VPN
Video Prediction	KTH	Cond	10	VPN
Video Prediction	KTH	PSNR	23.76	VPN
Video Prediction	KTH	Pred	20	VPN
Video Prediction	KTH	SSIM	0.746	VPN

Related Papers

Epona: Autoregressive Diffusion World Model for Autonomous Driving2025-06-30 Whole-Body Conditioned Egocentric Video Prediction2025-06-26 MinD: Unified Visual Imagination and Control via Hierarchical World Models2025-06-23 AMPLIFY: Actionless Motion Priors for Robot Learning from Videos2025-06-17 Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction2025-05-30 Autoregression-free video prediction using diffusion model for mitigating error propagation2025-05-28 Consistent World Models via Foresight Diffusion2025-05-22 Programmatic Video Prediction Using Large Language Models2025-05-20