NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan

2021-11-24Text-to-Image Generation Text-to-Video Generation Video Prediction Text to Image Generation Image Generation Video Generation

Paper PDF Code

Abstract

This paper presents a unified multimodal pre-trained model called N\"UWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate N\"UWA on 8 downstream tasks. Compared to several strong baselines, N\"UWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.

Results

Task	Dataset	Metric	Value	Model
Image Generation	COCO (Common Objects in Context)	FID	9.3	XMC-GAN (256 x 256)
Image Generation	COCO (Common Objects in Context)	Inception score	30.5	XMC-GAN (256 x 256)
Image Generation	COCO (Common Objects in Context)	FID	12.9	NÜWA (256 x 256)
Image Generation	COCO (Common Objects in Context)	Inception score	27.2	NÜWA (256 x 256)
Image Generation	COCO (Common Objects in Context)	FID	26	DM-GAN (256 x 256)
Image Generation	COCO (Common Objects in Context)	Inception score	32.2	DM-GAN (256 x 256)
Image Generation	COCO (Common Objects in Context)	FID	27.1	CogView (256 x 256)
Image Generation	COCO (Common Objects in Context)	Inception score	18.2	CogView (256 x 256)
Image Generation	COCO (Common Objects in Context)	FID	27.5	DALL-E (256 x 256)
Image Generation	COCO (Common Objects in Context)	Inception score	17.9	DALL-E (256 x 256)
Image Generation	COCO (Common Objects in Context)	FID	35.2	AttnGAN (256 x 256)
Image Generation	COCO (Common Objects in Context)	Inception score	23.3	AttnGAN (256 x 256)
Image Generation	COCO (Common Objects in Context)	Inception score	18.7	DF-GAN (256 x 256)
Video	BAIR Robot Pushing	Cond	1	NUWA
Video	BAIR Robot Pushing	FVD score	86.9	NUWA
Video	BAIR Robot Pushing	Pred	15	NUWA
Video	BAIR Robot Pushing	Train	15	NUWA
Video Generation	BAIR Robot Pushing	Cond	1	NUWA
Video Generation	BAIR Robot Pushing	FVD score	86.9	NUWA
Video Generation	BAIR Robot Pushing	Pred	15	NUWA
Video Generation	BAIR Robot Pushing	Train	15	NUWA
Text-to-Image Generation	COCO (Common Objects in Context)	FID	9.3	XMC-GAN (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	30.5	XMC-GAN (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	12.9	NÜWA (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	27.2	NÜWA (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	26	DM-GAN (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	32.2	DM-GAN (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	27.1	CogView (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	18.2	CogView (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	27.5	DALL-E (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	17.9	DALL-E (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	35.2	AttnGAN (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	23.3	AttnGAN (256 x 256)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	18.7	DF-GAN (256 x 256)
Text-to-Video Generation	Kinetics	Accuracy	77.9	NUWA (128×128)
Text-to-Video Generation	MSR-VTT	CLIP-FID	47.68	NUWA
Text-to-Video Generation	MSR-VTT	CLIPSIM	0.2439	NUWA
Text-to-Video Generation	MSR-VTT	FID	47.68	NUWA
10-shot image generation	COCO (Common Objects in Context)	FID	9.3	XMC-GAN (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	Inception score	30.5	XMC-GAN (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	FID	12.9	NÜWA (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	Inception score	27.2	NÜWA (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	FID	26	DM-GAN (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	Inception score	32.2	DM-GAN (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	FID	27.1	CogView (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	Inception score	18.2	CogView (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	FID	27.5	DALL-E (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	Inception score	17.9	DALL-E (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	FID	35.2	AttnGAN (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	Inception score	23.3	AttnGAN (256 x 256)
10-shot image generation	COCO (Common Objects in Context)	Inception score	18.7	DF-GAN (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	9.3	XMC-GAN (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	30.5	XMC-GAN (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	12.9	NÜWA (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	27.2	NÜWA (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	26	DM-GAN (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	32.2	DM-GAN (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	27.1	CogView (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	18.2	CogView (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	27.5	DALL-E (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	17.9	DALL-E (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	35.2	AttnGAN (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	23.3	AttnGAN (256 x 256)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	18.7	DF-GAN (256 x 256)

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Abstract

Results

Related Papers

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Abstract

Results

Related Papers