Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, YuChao Gu, Difei Gao, Mike Zheng Shou

2023-09-27Text-to-Video Generation Video Alignment Video Generation

Abstract

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

Results

Task	Dataset	Metric	Value	Model
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Motion Quality	52.19	Show-1
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Temporal Consistency	60.83	Show-1
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Text-to-Video Alignment	62.07	Show-1
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Total Score	229	Show-1
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Visual Quality	53.74	Show-1
Text-to-Video Generation	MSR-VTT	CLIPSIM	0.3072	Show-1
Text-to-Video Generation	MSR-VTT	FID	13.08	Show-1
Text-to-Video Generation	MSR-VTT	FVD	538	Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Abstract

Results

Related Papers

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Abstract

Results

Related Papers