TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Align your Latents: High-Resolution Video Synthesis with L...

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis

2023-04-18CVPR 2023 1Super-ResolutionText-to-Video GenerationVocal Bursts Intensity PredictionVideo Super-ResolutionImage GenerationVideo Generation
PaperPDFCodeCodeCodeCode(official)

Abstract

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD16550.61Video LDM (320x512, text-conditional)
VideoUCF-101Inception Score33.45Video LDM (320x512, text-conditional)
Video GenerationUCF-101FVD16550.61Video LDM (320x512, text-conditional)
Video GenerationUCF-101Inception Score33.45Video LDM (320x512, text-conditional)
Text-to-Video GenerationUCF-101FVD16550.61Video LDM (Zero-shot, 320x512)
Text-to-Video GenerationMSR-VTTCLIPSIM0.2929Video LDM
Text-to-Video GenerationMSR-VTTCLIP-FID24.78CogVideo (Chinese)
Text-to-Video GenerationMSR-VTTCLIPSIM0.2614CogVideo (Chinese)

Related Papers

SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17