TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LAVIE: High-Quality Video Generation with Cascaded Latent ...

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu

2023-09-26Super-ResolutionText-to-Video GenerationVideo Super-ResolutionVideo Generation
PaperPDFCodeCode(official)

Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD16526.3LAVIE (320x512, text-conditional)
Video GenerationUCF-101FVD16526.3LAVIE (320x512, text-conditional)
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetMotion Quality57.99Lavie
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetTemporal Consistency54.23Lavie
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetText-to-Video Alignment68.49Lavie
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetTotal Score234Lavie
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetVisual Quality52.83Lavie
Text-to-Video GenerationUCF-101FVD16526.3LAVIE (Zero-shot, 320x512)

Related Papers

SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution2025-07-14PanoDiff-SR: Synthesizing Dental Panoramic Radiographs using Diffusion and Super-resolution2025-07-12$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12