TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Photorealistic Video Generation with Diffusion Models

Photorealistic Video Generation with Diffusion Models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

2023-12-11Super-ResolutionText-to-Video GenerationVideo Super-ResolutionVideo Generation
PaperPDF

Abstract

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD16258.1W.A.L.T 3B (text-conditional)
VideoUCF-101Inception Score35.1W.A.L.T 3B (text-conditional)
VideoKinetics-600 12 frames, 64x64FVD3.3W.A.L.T.-L
Video PredictionKinetics-600 12 frames, 64x64FVD3.3W.A.L.T.-L
Video GenerationUCF-101FVD16258.1W.A.L.T 3B (text-conditional)
Video GenerationUCF-101Inception Score35.1W.A.L.T 3B (text-conditional)
Text-to-Video GenerationUCF-101FVD16258.1W.A.L.T 3B

Related Papers

SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution2025-07-14PanoDiff-SR: Synthesizing Dental Panoramic Radiographs using Diffusion and Super-resolution2025-07-12$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12