TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Make-A-Scene: Scene-Based Text-to-Image Generation with Hu...

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman

2022-03-24Text-to-Image GenerationText to Image GenerationSemantic SegmentationImage Generation
PaperPDFCode

Abstract

Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels, significantly improving visual quality. Through scene controllability, we introduce several new capabilities: (i) Scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story we wrote.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID7.55Make-a-Scene (unfiltered)
Image GenerationCOCO (Common Objects in Context)FID11.84Make-a-Scene (unfiltered)
Text-to-Image GenerationCOCO (Common Objects in Context)FID7.55Make-a-Scene (unfiltered)
Text-to-Image GenerationCOCO (Common Objects in Context)FID11.84Make-a-Scene (unfiltered)
10-shot image generationCOCO (Common Objects in Context)FID7.55Make-a-Scene (unfiltered)
10-shot image generationCOCO (Common Objects in Context)FID11.84Make-a-Scene (unfiltered)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID7.55Make-a-Scene (unfiltered)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID11.84Make-a-Scene (unfiltered)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17