Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans
We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | LAION-400M | CLIP | 25.19 | Imagen original (constant=6) |
| Video | LAION-400M | CLIP R-Precision | 92.12 | Imagen original (constant=6) |
| Video | LAION-400M | CLIP R-Precision | 90.97 | Imagen fully distilled (oscillate (15,1)) |
| Video | LAION-400M | CLIP | 25.29 | Imagen distilled (constant=6) |
| Video | LAION-400M | CLIP R-Precision | 90.88 | Imagen distilled (constant=6) |
| Video | LAION-400M | CLIP | 25.03 | Imagen original (oscillate(15,1)) |
| Video | LAION-400M | CLIP R-Precision | 89.91 | Imagen original (oscillate(15,1)) |
| Video | LAION-400M | CLIP R-Precision | 89.68 | Imagen fully distilled (constant=6) |
| Video | LAION-400M | CLIP | 25.12 | Imagen distilled (oscillate (15,1)) |
| Video | LAION-400M | CLIP R-Precision | 88.78 | Imagen distilled (oscillate (15,1)) |
| Video Generation | LAION-400M | CLIP | 25.19 | Imagen original (constant=6) |
| Video Generation | LAION-400M | CLIP R-Precision | 92.12 | Imagen original (constant=6) |
| Video Generation | LAION-400M | CLIP R-Precision | 90.97 | Imagen fully distilled (oscillate (15,1)) |
| Video Generation | LAION-400M | CLIP | 25.29 | Imagen distilled (constant=6) |
| Video Generation | LAION-400M | CLIP R-Precision | 90.88 | Imagen distilled (constant=6) |
| Video Generation | LAION-400M | CLIP | 25.03 | Imagen original (oscillate(15,1)) |
| Video Generation | LAION-400M | CLIP R-Precision | 89.91 | Imagen original (oscillate(15,1)) |
| Video Generation | LAION-400M | CLIP R-Precision | 89.68 | Imagen fully distilled (constant=6) |
| Video Generation | LAION-400M | CLIP | 25.12 | Imagen distilled (oscillate (15,1)) |
| Video Generation | LAION-400M | CLIP R-Precision | 88.78 | Imagen distilled (oscillate (15,1)) |