Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, Sergey Levine
Predicting the future in real-world settings, particularly from raw sensory observations such as images, is exceptionally challenging. Real-world events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images requires the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future. This can lead to low-quality predictions in real-world settings with stochastic dynamics. In this paper, we develop a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables. To the best of our knowledge, our model is the first to provide effective stochastic multi-frame prediction for real-world video. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple real-world datasets, both action-free and action-conditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. Our SV2P implementation will be open sourced upon publication.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | BAIR Robot Pushing | Cond | 2 | SV2P (from FVD) |
| Video | BAIR Robot Pushing | FVD score | 262.5 | SV2P (from FVD) |
| Video | BAIR Robot Pushing | Pred | 14 | SV2P (from FVD) |
| Video | BAIR Robot Pushing | Train | 14 | SV2P (from FVD) |
| Video | BAIR Robot Pushing | Cond | 2 | SV2P (from SRVP) |
| Video | BAIR Robot Pushing | Pred | 28 | SV2P (from SRVP) |
| Video | BAIR Robot Pushing | Train | 12 | SV2P (from SRVP) |
| Video | KTH | Cond | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | FVD | 209.5 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | LPIPS | 0.232 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | PSNR | 25.87 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Params (M) | 8.3 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Pred | 40 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | SSIM | 0.782 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Train | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Cond | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | FVD | 253.5 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | LPIPS | 0.26 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | PSNR | 25.7 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Params (M) | 8.3 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Pred | 40 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | SSIM | 0.772 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Train | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video | KTH | Cond | 10 | SV2P (from SRVP) |
| Video | KTH | Pred | 30 | SV2P (from SRVP) |
| Video | KTH | SSIM | 0.838 | SV2P (from SRVP) |
| Video | KTH | Train | 10 | SV2P (from SRVP) |
| Video Prediction | KTH | Cond | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | FVD | 209.5 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | LPIPS | 0.232 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | PSNR | 25.87 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Params (M) | 8.3 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Pred | 40 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | SSIM | 0.782 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Train | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Cond | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | FVD | 253.5 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | LPIPS | 0.26 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | PSNR | 25.7 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Params (M) | 8.3 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Pred | 40 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | SSIM | 0.772 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Train | 10 | SV2P time-invariant (from Grid-keypoints) |
| Video Prediction | KTH | Cond | 10 | SV2P (from SRVP) |
| Video Prediction | KTH | Pred | 30 | SV2P (from SRVP) |
| Video Prediction | KTH | SSIM | 0.838 | SV2P (from SRVP) |
| Video Prediction | KTH | Train | 10 | SV2P (from SRVP) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SV2P (from FVD) |
| Video Generation | BAIR Robot Pushing | FVD score | 262.5 | SV2P (from FVD) |
| Video Generation | BAIR Robot Pushing | Pred | 14 | SV2P (from FVD) |
| Video Generation | BAIR Robot Pushing | Train | 14 | SV2P (from FVD) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SV2P (from SRVP) |
| Video Generation | BAIR Robot Pushing | Pred | 28 | SV2P (from SRVP) |
| Video Generation | BAIR Robot Pushing | Train | 12 | SV2P (from SRVP) |