TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Show-1: Marrying Pixel and Latent Diffusion Models for Tex...

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, YuChao Gu, Difei Gao, Mike Zheng Shou

2023-09-27Text-to-Video GenerationVideo AlignmentVideo Generation
PaperPDFCode(official)

Abstract

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

Results

TaskDatasetMetricValueModel
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetMotion Quality52.19Show-1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetTemporal Consistency60.83Show-1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetText-to-Video Alignment62.07Show-1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetTotal Score229Show-1
Text-to-Video GenerationEvalCrafter Text-to-Video (ECTV) DatasetVisual Quality53.74Show-1
Text-to-Video GenerationMSR-VTTCLIPSIM0.3072Show-1
Text-to-Video GenerationMSR-VTTFID13.08Show-1
Text-to-Video GenerationMSR-VTTFVD538Show-1

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11Scaling RL to Long Videos2025-07-10Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10