MSPred: Video Prediction at Multiple Spatio-Temporal Scales with Hierarchical Recurrent Networks

Angel Villar-Corrales, Ani Karapetyan, Andreas Boltres, Sven Behnke

2022-03-17Video Prediction Prediction

Abstract

Autonomous systems not only need to understand their current environment, but should also be able to predict future actions conditioned on past states, for instance based on captured camera frames. However, existing models mainly focus on forecasting future video frames for short time-horizons, hence being of limited use for long-term action planning. We propose Multi-Scale Hierarchical Prediction (MSPred), a novel video prediction model able to simultaneously forecast future possible outcomes of different levels of granularity at different spatio-temporal scales. By combining spatial and temporal downsampling, MSPred efficiently predicts abstract representations such as human poses or locations over long time horizons, while still maintaining a competitive performance for video frame prediction. In our experiments, we demonstrate that MSPred accurately predicts future video frames as well as high-level representations (e.g. keypoints or semantics) on bin-picking and action recognition datasets, while consistently outperforming popular approaches for future frame prediction. Furthermore, we ablate different modules and design choices in MSPred, experimentally validating that combining features of different spatial and temporal granularity leads to a superior performance. Code and models to reproduce our experiments can be found in https://github.com/AIS-Bonn/MSPred.

Results

Task	Dataset	Metric	Value	Model
Video	Moving MNIST	LPIPS	0.024	MSPred
Video	Moving MNIST	MSE	34.44	MSPred
Video	Moving MNIST	PSNR	26.82	MSPred
Video	Moving MNIST	SSIM	0.975	MSPred
Video	KTH	LPIPS	0.029	MSPred
Video	KTH	MSE	23.18	MSPred
Video	KTH	PSNR	27.81	MSPred
Video	KTH	SSIM	0.951	MSPred
Video	SynpickVP	LPIPS	0.033	MSPred
Video	SynpickVP	MSE	53.09	MSPred
Video	SynpickVP	PSNR	27.89	MSPred
Video	SynpickVP	SSIM	0.881	MSPred
Video Prediction	Moving MNIST	LPIPS	0.024	MSPred
Video Prediction	Moving MNIST	MSE	34.44	MSPred
Video Prediction	Moving MNIST	PSNR	26.82	MSPred
Video Prediction	Moving MNIST	SSIM	0.975	MSPred
Video Prediction	KTH	LPIPS	0.029	MSPred
Video Prediction	KTH	MSE	23.18	MSPred
Video Prediction	KTH	PSNR	27.81	MSPred
Video Prediction	KTH	SSIM	0.951	MSPred
Video Prediction	SynpickVP	LPIPS	0.033	MSPred
Video Prediction	SynpickVP	MSE	53.09	MSPred
Video Prediction	SynpickVP	PSNR	27.89	MSPred
Video Prediction	SynpickVP	SSIM	0.881	MSPred

MSPred: Video Prediction at Multiple Spatio-Temporal Scales with Hierarchical Recurrent Networks

Abstract

Results

Related Papers

MSPred: Video Prediction at Multiple Spatio-Temporal Scales with Hierarchical Recurrent Networks

Abstract

Results

Related Papers