Lluis Castrejon, Nicolas Ballas, Aaron Courville
Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | BAIR Robot Pushing | Cond | 2 | Hier-VRNN |
| Video | BAIR Robot Pushing | FVD score | 143.4 | Hier-VRNN |
| Video | BAIR Robot Pushing | Pred | 28 | Hier-VRNN |
| Video | BAIR Robot Pushing | Train | 10 | Hier-VRNN |
| Video | BAIR Robot Pushing | Cond | 2 | VRNN 1L |
| Video | BAIR Robot Pushing | FVD score | 149.22 | VRNN 1L |
| Video | BAIR Robot Pushing | Pred | 28 | VRNN 1L |
| Video | BAIR Robot Pushing | Train | 10 | VRNN 1L |
| Video | Cityscapes 128x128 | Cond. | 2 | Hier-VRNN |
| Video | Cityscapes 128x128 | FVD | 567.51 | Hier-VRNN |
| Video | Cityscapes 128x128 | Pred | 28 | Hier-VRNN |
| Video | Cityscapes 128x128 | Train | 10 | Hier-VRNN |
| Video Prediction | Cityscapes 128x128 | Cond. | 2 | Hier-VRNN |
| Video Prediction | Cityscapes 128x128 | FVD | 567.51 | Hier-VRNN |
| Video Prediction | Cityscapes 128x128 | Pred | 28 | Hier-VRNN |
| Video Prediction | Cityscapes 128x128 | Train | 10 | Hier-VRNN |
| Video Generation | BAIR Robot Pushing | Cond | 2 | Hier-VRNN |
| Video Generation | BAIR Robot Pushing | FVD score | 143.4 | Hier-VRNN |
| Video Generation | BAIR Robot Pushing | Pred | 28 | Hier-VRNN |
| Video Generation | BAIR Robot Pushing | Train | 10 | Hier-VRNN |
| Video Generation | BAIR Robot Pushing | Cond | 2 | VRNN 1L |
| Video Generation | BAIR Robot Pushing | FVD score | 149.22 | VRNN 1L |
| Video Generation | BAIR Robot Pushing | Pred | 28 | VRNN 1L |
| Video Generation | BAIR Robot Pushing | Train | 10 | VRNN 1L |