TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Preserve Your Own Correlation: A Noise Prior for Video Dif...

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji

2023-05-17ICCV 2023 1Text-to-Video GenerationImage GenerationVideo Generation
PaperPDF

Abstract

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD16310PYoCo (Zero-shot, 64x64, unconditional)
VideoUCF-101Inception Score60.01PYoCo (Zero-shot, 64x64, unconditional)
VideoUCF-101FVD16355.19PYoCo (Zero-shot, 64x64, text-conditional)
VideoUCF-101Inception Score47.76PYoCo (Zero-shot, 64x64, text-conditional)
Video GenerationUCF-101FVD16310PYoCo (Zero-shot, 64x64, unconditional)
Video GenerationUCF-101Inception Score60.01PYoCo (Zero-shot, 64x64, unconditional)
Video GenerationUCF-101FVD16355.19PYoCo (Zero-shot, 64x64, text-conditional)
Video GenerationUCF-101Inception Score47.76PYoCo (Zero-shot, 64x64, text-conditional)
Text-to-Video GenerationUCF-101FVD16355.19PYoCo (Zero-shot, 64x64)

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17