TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/NÜWA: Visual Synthesis Pre-training for Neural visUal Worl...

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan

2021-11-24Text-to-Image GenerationText-to-Video GenerationVideo PredictionText to Image GenerationImage GenerationVideo Generation
PaperPDFCode

Abstract

This paper presents a unified multimodal pre-trained model called N\"UWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate N\"UWA on 8 downstream tasks. Compared to several strong baselines, N\"UWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID9.3XMC-GAN (256 x 256)
Image GenerationCOCO (Common Objects in Context)Inception score30.5XMC-GAN (256 x 256)
Image GenerationCOCO (Common Objects in Context)FID12.9NÜWA (256 x 256)
Image GenerationCOCO (Common Objects in Context)Inception score27.2NÜWA (256 x 256)
Image GenerationCOCO (Common Objects in Context)FID26DM-GAN (256 x 256)
Image GenerationCOCO (Common Objects in Context)Inception score32.2DM-GAN (256 x 256)
Image GenerationCOCO (Common Objects in Context)FID27.1CogView (256 x 256)
Image GenerationCOCO (Common Objects in Context)Inception score18.2CogView (256 x 256)
Image GenerationCOCO (Common Objects in Context)FID27.5DALL-E (256 x 256)
Image GenerationCOCO (Common Objects in Context)Inception score17.9DALL-E (256 x 256)
Image GenerationCOCO (Common Objects in Context)FID35.2AttnGAN (256 x 256)
Image GenerationCOCO (Common Objects in Context)Inception score23.3AttnGAN (256 x 256)
Image GenerationCOCO (Common Objects in Context)Inception score18.7DF-GAN (256 x 256)
VideoBAIR Robot PushingCond1NUWA
VideoBAIR Robot PushingFVD score86.9NUWA
VideoBAIR Robot PushingPred15NUWA
VideoBAIR Robot PushingTrain15NUWA
Video GenerationBAIR Robot PushingCond1NUWA
Video GenerationBAIR Robot PushingFVD score86.9NUWA
Video GenerationBAIR Robot PushingPred15NUWA
Video GenerationBAIR Robot PushingTrain15NUWA
Text-to-Image GenerationCOCO (Common Objects in Context)FID9.3XMC-GAN (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score30.5XMC-GAN (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)FID12.9NÜWA (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score27.2NÜWA (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)FID26DM-GAN (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score32.2DM-GAN (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)FID27.1CogView (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score18.2CogView (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)FID27.5DALL-E (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score17.9DALL-E (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)FID35.2AttnGAN (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score23.3AttnGAN (256 x 256)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score18.7DF-GAN (256 x 256)
Text-to-Video GenerationKineticsAccuracy77.9NUWA (128×128)
Text-to-Video GenerationMSR-VTTCLIP-FID47.68NUWA
Text-to-Video GenerationMSR-VTTCLIPSIM0.2439NUWA
Text-to-Video GenerationMSR-VTTFID47.68NUWA
10-shot image generationCOCO (Common Objects in Context)FID9.3XMC-GAN (256 x 256)
10-shot image generationCOCO (Common Objects in Context)Inception score30.5XMC-GAN (256 x 256)
10-shot image generationCOCO (Common Objects in Context)FID12.9NÜWA (256 x 256)
10-shot image generationCOCO (Common Objects in Context)Inception score27.2NÜWA (256 x 256)
10-shot image generationCOCO (Common Objects in Context)FID26DM-GAN (256 x 256)
10-shot image generationCOCO (Common Objects in Context)Inception score32.2DM-GAN (256 x 256)
10-shot image generationCOCO (Common Objects in Context)FID27.1CogView (256 x 256)
10-shot image generationCOCO (Common Objects in Context)Inception score18.2CogView (256 x 256)
10-shot image generationCOCO (Common Objects in Context)FID27.5DALL-E (256 x 256)
10-shot image generationCOCO (Common Objects in Context)Inception score17.9DALL-E (256 x 256)
10-shot image generationCOCO (Common Objects in Context)FID35.2AttnGAN (256 x 256)
10-shot image generationCOCO (Common Objects in Context)Inception score23.3AttnGAN (256 x 256)
10-shot image generationCOCO (Common Objects in Context)Inception score18.7DF-GAN (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID9.3XMC-GAN (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score30.5XMC-GAN (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID12.9NÜWA (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score27.2NÜWA (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID26DM-GAN (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score32.2DM-GAN (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID27.1CogView (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score18.2CogView (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID27.5DALL-E (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score17.9DALL-E (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID35.2AttnGAN (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score23.3AttnGAN (256 x 256)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score18.7DF-GAN (256 x 256)

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17