TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoCrafter1: Open Diffusion Models for High-Quality Vide...

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan

2023-10-30Text-to-Video GenerationVideo Generation
PaperPDFCodeCode(official)Code

Abstract

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

Results

TaskDatasetMetricValueModel
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetMotion Quality60.85VideoCrafter1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetTemporal Consistency55.89VideoCrafter1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetText-to-Video Alignment61.95VideoCrafter1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetTotal Score232VideoCrafter1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetVisual Quality53.08VideoCrafter1

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11Scaling RL to Long Videos2025-07-10Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10