Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen

2022-11-23Denoising Text-to-Video Generation Vocal Bursts Intensity Prediction Image Generation Video Generation

Abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Results

Task	Dataset	Metric	Value	Model
Video	UCF-101	FVD16	372	LVDM (256x256, unconditional)
Video	UCF-101	KVD16	27	LVDM (256x256, unconditional)
Video	UCF-101	FVD16	552	LVDM (256x256, unconditional)
Video	UCF-101	KVD16	42	LVDM (256x256, unconditional)
Video	UCF-101	FVD16	1209	TGAN-v2 (128x128)
Video	UCF-101	FVD16	1396	VDM
Video	UCF-101	KVD16	116	VDM
Video	UCF-101	FVD16	2460	MCVD
Video	UCF-101	KVD16	148	MCVD
Video	Taichi	FVD16	94.6	TATS (128x128)
Video	Taichi	KVD16	9.8	TATS (128x128)
Video	Taichi	FVD16	99	LVDM (256x256)
Video	Taichi	KVD16	15.3	LVDM (256x256)
Video	Taichi	FVD16	128.1	DIGAN (128x128)
Video	Taichi	KVD16	20.6	DIGAN (128x128)
Video	Taichi	FVD16	144.7	MoCoGAN-HD (128x128)
Video	Taichi	KVD16	25.4	MoCoGAN-HD (128x128)
Video	Taichi	FVD16	156.7	DIGAN (256x256)
Video	Sky Time-lapse	FVD 16	95.2	LVDM (256x256)
Video	Sky Time-lapse	KVD16	3.9	LVDM (256x256)
Video	Sky Time-lapse	FVD 16	107.5	Long-video GAN (128x128)
Video	Sky Time-lapse	FVD 16	114.6	DIGAN (128x128)
Video	Sky Time-lapse	KVD16	6.8	DIGAN (128x128)
Video	Sky Time-lapse	FVD 16	116.5	Long-video GAN (256x256)
Video	Sky Time-lapse	FVD 16	132.6	TATS (128x128)
Video	Sky Time-lapse	KVD16	5.7	TATS (128x128)
Video	Sky Time-lapse	FVD 16	183.6	MoCoGAN-HD (128x128)
Video	Sky Time-lapse	KVD16	13.9	MoCoGAN-HD (128x128)
Video Generation	UCF-101	FVD16	372	LVDM (256x256, unconditional)
Video Generation	UCF-101	KVD16	27	LVDM (256x256, unconditional)
Video Generation	UCF-101	FVD16	552	LVDM (256x256, unconditional)
Video Generation	UCF-101	KVD16	42	LVDM (256x256, unconditional)
Video Generation	UCF-101	FVD16	1209	TGAN-v2 (128x128)
Video Generation	UCF-101	FVD16	1396	VDM
Video Generation	UCF-101	KVD16	116	VDM
Video Generation	UCF-101	FVD16	2460	MCVD
Video Generation	UCF-101	KVD16	148	MCVD
Video Generation	Taichi	FVD16	94.6	TATS (128x128)
Video Generation	Taichi	KVD16	9.8	TATS (128x128)
Video Generation	Taichi	FVD16	99	LVDM (256x256)
Video Generation	Taichi	KVD16	15.3	LVDM (256x256)
Video Generation	Taichi	FVD16	128.1	DIGAN (128x128)
Video Generation	Taichi	KVD16	20.6	DIGAN (128x128)
Video Generation	Taichi	FVD16	144.7	MoCoGAN-HD (128x128)
Video Generation	Taichi	KVD16	25.4	MoCoGAN-HD (128x128)
Video Generation	Taichi	FVD16	156.7	DIGAN (256x256)
Video Generation	Sky Time-lapse	FVD 16	95.2	LVDM (256x256)
Video Generation	Sky Time-lapse	KVD16	3.9	LVDM (256x256)
Video Generation	Sky Time-lapse	FVD 16	107.5	Long-video GAN (128x128)
Video Generation	Sky Time-lapse	FVD 16	114.6	DIGAN (128x128)
Video Generation	Sky Time-lapse	KVD16	6.8	DIGAN (128x128)
Video Generation	Sky Time-lapse	FVD 16	116.5	Long-video GAN (256x256)
Video Generation	Sky Time-lapse	FVD 16	132.6	TATS (128x128)
Video Generation	Sky Time-lapse	KVD16	5.7	TATS (128x128)
Video Generation	Sky Time-lapse	FVD 16	183.6	MoCoGAN-HD (128x128)
Video Generation	Sky Time-lapse	KVD16	13.9	MoCoGAN-HD (128x128)

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Abstract

Results

Related Papers

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Abstract

Results

Related Papers