TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Latent Video Diffusion Models for High-Fidelity Long Video...

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen

2022-11-23DenoisingText-to-Video GenerationVocal Bursts Intensity PredictionImage GenerationVideo Generation
PaperPDFCode(official)

Abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD16372LVDM (256x256, unconditional)
VideoUCF-101KVD1627LVDM (256x256, unconditional)
VideoUCF-101FVD16552LVDM (256x256, unconditional)
VideoUCF-101KVD1642LVDM (256x256, unconditional)
VideoUCF-101FVD161209TGAN-v2 (128x128)
VideoUCF-101FVD161396VDM
VideoUCF-101KVD16116VDM
VideoUCF-101FVD162460MCVD
VideoUCF-101KVD16148MCVD
VideoTaichiFVD1694.6TATS (128x128)
VideoTaichiKVD169.8TATS (128x128)
VideoTaichiFVD1699LVDM (256x256)
VideoTaichiKVD1615.3LVDM (256x256)
VideoTaichiFVD16128.1DIGAN (128x128)
VideoTaichiKVD1620.6DIGAN (128x128)
VideoTaichiFVD16144.7MoCoGAN-HD (128x128)
VideoTaichiKVD1625.4MoCoGAN-HD (128x128)
VideoTaichiFVD16156.7DIGAN (256x256)
VideoSky Time-lapseFVD 1695.2LVDM (256x256)
VideoSky Time-lapseKVD163.9LVDM (256x256)
VideoSky Time-lapseFVD 16107.5Long-video GAN (128x128)
VideoSky Time-lapseFVD 16114.6DIGAN (128x128)
VideoSky Time-lapseKVD166.8DIGAN (128x128)
VideoSky Time-lapseFVD 16116.5Long-video GAN (256x256)
VideoSky Time-lapseFVD 16132.6TATS (128x128)
VideoSky Time-lapseKVD165.7TATS (128x128)
VideoSky Time-lapseFVD 16183.6MoCoGAN-HD (128x128)
VideoSky Time-lapseKVD1613.9MoCoGAN-HD (128x128)
Video GenerationUCF-101FVD16372LVDM (256x256, unconditional)
Video GenerationUCF-101KVD1627LVDM (256x256, unconditional)
Video GenerationUCF-101FVD16552LVDM (256x256, unconditional)
Video GenerationUCF-101KVD1642LVDM (256x256, unconditional)
Video GenerationUCF-101FVD161209TGAN-v2 (128x128)
Video GenerationUCF-101FVD161396VDM
Video GenerationUCF-101KVD16116VDM
Video GenerationUCF-101FVD162460MCVD
Video GenerationUCF-101KVD16148MCVD
Video GenerationTaichiFVD1694.6TATS (128x128)
Video GenerationTaichiKVD169.8TATS (128x128)
Video GenerationTaichiFVD1699LVDM (256x256)
Video GenerationTaichiKVD1615.3LVDM (256x256)
Video GenerationTaichiFVD16128.1DIGAN (128x128)
Video GenerationTaichiKVD1620.6DIGAN (128x128)
Video GenerationTaichiFVD16144.7MoCoGAN-HD (128x128)
Video GenerationTaichiKVD1625.4MoCoGAN-HD (128x128)
Video GenerationTaichiFVD16156.7DIGAN (256x256)
Video GenerationSky Time-lapseFVD 1695.2LVDM (256x256)
Video GenerationSky Time-lapseKVD163.9LVDM (256x256)
Video GenerationSky Time-lapseFVD 16107.5Long-video GAN (128x128)
Video GenerationSky Time-lapseFVD 16114.6DIGAN (128x128)
Video GenerationSky Time-lapseKVD166.8DIGAN (128x128)
Video GenerationSky Time-lapseFVD 16116.5Long-video GAN (256x256)
Video GenerationSky Time-lapseFVD 16132.6TATS (128x128)
Video GenerationSky Time-lapseKVD165.7TATS (128x128)
Video GenerationSky Time-lapseFVD 16183.6MoCoGAN-HD (128x128)
Video GenerationSky Time-lapseKVD1613.9MoCoGAN-HD (128x128)

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17