Emily Denton, Rob Fergus
Generating video frames that accurately predict future world states is challenging. Existing approaches either fail to capture the full distribution of outcomes, or yield blurry generations, or both. In this paper we introduce an unsupervised video generation model that learns a prior model of uncertainty in a given environment. Video frames are generated by drawing samples from this prior and combining them with a deterministic estimate of the future frame. The approach is simple and easily trained end-to-end on a variety of datasets. Sample generations are both varied and sharp, even many frames into the future, and compare favorably to those from existing approaches.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | BAIR Robot Pushing | Cond | 2 | SVG (from SRVP) |
| Video | BAIR Robot Pushing | Pred | 28 | SVG (from SRVP) |
| Video | BAIR Robot Pushing | Train | 12 | SVG (from SRVP) |
| Video | BAIR Robot Pushing | Cond | 2 | SVG-LP (from vRNN) |
| Video | BAIR Robot Pushing | FVD score | 256.62 | SVG-LP (from vRNN) |
| Video | BAIR Robot Pushing | Pred | 28 | SVG-LP (from vRNN) |
| Video | BAIR Robot Pushing | Train | 10 | SVG-LP (from vRNN) |
| Video | BAIR Robot Pushing | Cond | 2 | SVG-FP (from FVD) |
| Video | BAIR Robot Pushing | FVD score | 315.5 | SVG-FP (from FVD) |
| Video | BAIR Robot Pushing | Pred | 14 | SVG-FP (from FVD) |
| Video | BAIR Robot Pushing | Train | 14 | SVG-FP (from FVD) |
| Video | KTH | Cond | 10 | SVG-LP (from Grid-keypoints) |
| Video | KTH | FVD | 157.9 | SVG-LP (from Grid-keypoints) |
| Video | KTH | LPIPS | 0.129 | SVG-LP (from Grid-keypoints) |
| Video | KTH | PSNR | 23.91 | SVG-LP (from Grid-keypoints) |
| Video | KTH | Params (M) | 22.8 | SVG-LP (from Grid-keypoints) |
| Video | KTH | Pred | 40 | SVG-LP (from Grid-keypoints) |
| Video | KTH | SSIM | 0.8 | SVG-LP (from Grid-keypoints) |
| Video | KTH | Train | 10 | SVG-LP (from Grid-keypoints) |
| Video | KTH | Cond | 10 | SVG-LP (from SRVP) |
| Video | KTH | Pred | 30 | SVG-LP (from SRVP) |
| Video | KTH | Train | 10 | SVG-LP (from SRVP) |
| Video | SynpickVP | LPIPS | 0.066 | SVG-LP |
| Video | SynpickVP | MSE | 51.82 | SVG-LP |
| Video | SynpickVP | SSIM | 0.886 | SVG-LP |
| Video | SynpickVP | LPIPS | 0.068 | SVG-Det |
| Video | SynpickVP | MSE | 60.6 | SVG-Det |
| Video | SynpickVP | PSNR | 26.92 | SVG-Det |
| Video | SynpickVP | SSIM | 0.879 | SVG-Det |
| Video | Cityscapes 128x128 | Cond. | 2 | SVG (from Hier-VRNN) |
| Video | Cityscapes 128x128 | FVD | 1300.26 | SVG (from Hier-VRNN) |
| Video | Cityscapes 128x128 | Pred | 28 | SVG (from Hier-VRNN) |
| Video | Cityscapes 128x128 | Train | 10 | SVG (from Hier-VRNN) |
| Video Prediction | KTH | Cond | 10 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | FVD | 157.9 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | LPIPS | 0.129 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | PSNR | 23.91 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | Params (M) | 22.8 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | Pred | 40 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | SSIM | 0.8 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | Train | 10 | SVG-LP (from Grid-keypoints) |
| Video Prediction | KTH | Cond | 10 | SVG-LP (from SRVP) |
| Video Prediction | KTH | Pred | 30 | SVG-LP (from SRVP) |
| Video Prediction | KTH | Train | 10 | SVG-LP (from SRVP) |
| Video Prediction | SynpickVP | LPIPS | 0.066 | SVG-LP |
| Video Prediction | SynpickVP | MSE | 51.82 | SVG-LP |
| Video Prediction | SynpickVP | SSIM | 0.886 | SVG-LP |
| Video Prediction | SynpickVP | LPIPS | 0.068 | SVG-Det |
| Video Prediction | SynpickVP | MSE | 60.6 | SVG-Det |
| Video Prediction | SynpickVP | PSNR | 26.92 | SVG-Det |
| Video Prediction | SynpickVP | SSIM | 0.879 | SVG-Det |
| Video Prediction | Cityscapes 128x128 | Cond. | 2 | SVG (from Hier-VRNN) |
| Video Prediction | Cityscapes 128x128 | FVD | 1300.26 | SVG (from Hier-VRNN) |
| Video Prediction | Cityscapes 128x128 | Pred | 28 | SVG (from Hier-VRNN) |
| Video Prediction | Cityscapes 128x128 | Train | 10 | SVG (from Hier-VRNN) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SVG (from SRVP) |
| Video Generation | BAIR Robot Pushing | Pred | 28 | SVG (from SRVP) |
| Video Generation | BAIR Robot Pushing | Train | 12 | SVG (from SRVP) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SVG-LP (from vRNN) |
| Video Generation | BAIR Robot Pushing | FVD score | 256.62 | SVG-LP (from vRNN) |
| Video Generation | BAIR Robot Pushing | Pred | 28 | SVG-LP (from vRNN) |
| Video Generation | BAIR Robot Pushing | Train | 10 | SVG-LP (from vRNN) |
| Video Generation | BAIR Robot Pushing | Cond | 2 | SVG-FP (from FVD) |
| Video Generation | BAIR Robot Pushing | FVD score | 315.5 | SVG-FP (from FVD) |
| Video Generation | BAIR Robot Pushing | Pred | 14 | SVG-FP (from FVD) |
| Video Generation | BAIR Robot Pushing | Train | 14 | SVG-FP (from FVD) |