Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, Sergey Levine
Being able to predict what may happen in the future requires an in-depth understanding of the physical and causal rules that govern the world. A model that is able to do so has a number of appealing applications, from robotic planning to representation learning. However, learning to predict raw future observations, such as frames in a video, is exceedingly challenging -- the ambiguous nature of the problem can cause a naively designed model to average together possible futures into a single, blurry prediction. Recently, this has been addressed by two distinct approaches: (a) latent variational variable models that explicitly model underlying stochasticity and (b) adversarially-trained models that aim to produce naturalistic images. However, a standard latent variable model can struggle to produce realistic results, and a standard adversarially-trained model underutilizes latent variables and fails to produce diverse predictions. We show that these distinct methods are in fact complementary. Combining the two produces predictions that look more realistic to human raters and better cover the range of possible futures. Our method outperforms prior and concurrent work in these aspects.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | BAIR Robot Pushing | Cond | 2 | SAVP (from FVD) |
| Video | BAIR Robot Pushing | FVD score | 116.4 | SAVP (from FVD) |
| Video | BAIR Robot Pushing | Pred | 14 | SAVP (from FVD) |
| Video | BAIR Robot Pushing | Train | 14 | SAVP (from FVD) |
| Video | BAIR Robot Pushing | Cond | 2 | SAVP (from vRNN) |
| Video | BAIR Robot Pushing | FVD score | 143.43 | SAVP (from vRNN) |
| Video | BAIR Robot Pushing | Pred | 28 | SAVP (from vRNN) |
| Video | BAIR Robot Pushing | Train | 10 | SAVP (from vRNN) |
| Video | BAIR Robot Pushing | Cond | 2 | SAVP (from SRVP) |
| Video | BAIR Robot Pushing | Pred | 28 | SAVP (from SRVP) |
| Video | BAIR Robot Pushing | Train | 12 | SAVP (from SRVP) |
| Video | BAIR Robot Pushing | Cond | 2 | SAVP-VAE (from WAM) |
| Video | BAIR Robot Pushing | PSNR | 19.09 | SAVP-VAE (from WAM) |
| Video | BAIR Robot Pushing | Pred | 28 | SAVP-VAE (from WAM) |
| Video | BAIR Robot Pushing | SSIM | 0.815 | SAVP-VAE (from WAM) |
| Video | BAIR Robot Pushing | Train | 14 | SAVP-VAE (from WAM) |
| Video | KTH | Cond | 10 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | FVD | 145.7 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | LPIPS | 0.116 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | PSNR | 26 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | Params (M) | 7.3 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | Pred | 40 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | SSIM | 0.806 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | Train | 10 | SAVP-VAE (from Grid-keypoints) |
| Video | KTH | Cond | 10 | SAVP (from Grid-keypoints) |
| Video | KTH | FVD | 183.7 | SAVP (from Grid-keypoints) |
| Video | KTH | LPIPS | 0.126 | SAVP (from Grid-keypoints) |
| Video | KTH | PSNR | 23.79 | SAVP (from Grid-keypoints) |
| Video | KTH | Params (M) | 17.6 | SAVP (from Grid-keypoints) |
| Video | KTH | Pred | 40 | SAVP (from Grid-keypoints) |
| Video | KTH | SSIM | 0.699 | SAVP (from Grid-keypoints) |
| Video | KTH | Train | 10 | SAVP (from Grid-keypoints) |
| Video | KTH | Cond | 10 | SAVP (from SRVP) |
| Video | KTH | Pred | 30 | SAVP (from SRVP) |
| Video | KTH | Train | 10 | SAVP (from SRVP) |
| Video | KTH | Cond | 10 | SAVP-VAE |
| Video | KTH | PSNR | 27.77 | SAVP-VAE |
| Video | KTH | Pred | 20 | SAVP-VAE |
| Video | KTH | SSIM | 0.852 | SAVP-VAE |
| Video Prediction | KTH | Cond | 10 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | FVD | 145.7 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | LPIPS | 0.116 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | PSNR | 26 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | Params (M) | 7.3 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | Pred | 40 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | SSIM | 0.806 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | Train | 10 | SAVP-VAE (from Grid-keypoints) |
| Video Prediction | KTH | Cond | 10 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | FVD | 183.7 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | LPIPS | 0.126 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | PSNR | 23.79 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | Params (M) | 17.6 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | Pred | 40 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | SSIM | 0.699 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | Train | 10 | SAVP (from Grid-keypoints) |
| Video Prediction | KTH | Cond | 10 | SAVP (from SRVP) |
| Video Prediction | KTH | Pred | 30 | SAVP (from SRVP) |
| Video Prediction | KTH | Train | 10 | SAVP (from SRVP) |
| Video Prediction | KTH | Cond | 10 | SAVP-VAE |
| Video Prediction | KTH | PSNR | 27.77 | SAVP-VAE |
| Video Prediction | KTH | Pred | 20 | SAVP-VAE |
| Video Prediction | KTH | SSIM | 0.852 | SAVP-VAE |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SAVP (from FVD) |
| Video Generation | BAIR Robot Pushing | FVD score | 116.4 | SAVP (from FVD) |
| Video Generation | BAIR Robot Pushing | Pred | 14 | SAVP (from FVD) |
| Video Generation | BAIR Robot Pushing | Train | 14 | SAVP (from FVD) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SAVP (from vRNN) |
| Video Generation | BAIR Robot Pushing | FVD score | 143.43 | SAVP (from vRNN) |
| Video Generation | BAIR Robot Pushing | Pred | 28 | SAVP (from vRNN) |
| Video Generation | BAIR Robot Pushing | Train | 10 | SAVP (from vRNN) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SAVP (from SRVP) |
| Video Generation | BAIR Robot Pushing | Pred | 28 | SAVP (from SRVP) |
| Video Generation | BAIR Robot Pushing | Train | 12 | SAVP (from SRVP) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SAVP-VAE (from WAM) |
| Video Generation | BAIR Robot Pushing | PSNR | 19.09 | SAVP-VAE (from WAM) |
| Video Generation | BAIR Robot Pushing | Pred | 28 | SAVP-VAE (from WAM) |
| Video Generation | BAIR Robot Pushing | SSIM | 0.815 | SAVP-VAE (from WAM) |
| Video Generation | BAIR Robot Pushing | Train | 14 | SAVP-VAE (from WAM) |