TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GLIGEN: Open-Set Grounded Text-to-Image Generation

GLIGEN: Open-Set Grounded Text-to-Image Generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee

2023-01-17CVPR 2023 1Text-to-Image GenerationConditional Text-to-Image SynthesisLayout-to-Image GenerationText to Image GenerationImage InpaintingImage Generation
PaperPDFCode(official)

Abstract

Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID5.61GLIGEN (fine-tuned, Detection + Caption data)
Image GenerationCOCO (Common Objects in Context)FID5.82GLIGEN (fine-tuned, Detection data only)
Image GenerationCOCO (Common Objects in Context)FID6.38GLIGEN (fine-tuned, Grounding data)
Image GenerationCOCO-MIGinstance success rate0.3Gligen (zero-shot)
Image GenerationCOCO-MIGmIoU0.27Gligen (zero-shot)
Image GenerationLayoutBench-COCO - SizeAP33.3GLIGEN
Image GenerationLayoutBench-COCO - CombinationAP36.3GLIGEN
Image GenerationLayoutBench-COCO - NumberAP30.7GLIGEN
Image GenerationLayoutBench-COCO - PositionAP38.9GLIGEN
Text-to-Image GenerationCOCO (Common Objects in Context)FID5.61GLIGEN (fine-tuned, Detection + Caption data)
Text-to-Image GenerationCOCO (Common Objects in Context)FID5.82GLIGEN (fine-tuned, Detection data only)
Text-to-Image GenerationCOCO (Common Objects in Context)FID6.38GLIGEN (fine-tuned, Grounding data)
Text-to-Image GenerationCOCO-MIGinstance success rate0.3Gligen (zero-shot)
Text-to-Image GenerationCOCO-MIGmIoU0.27Gligen (zero-shot)
10-shot image generationCOCO (Common Objects in Context)FID5.61GLIGEN (fine-tuned, Detection + Caption data)
10-shot image generationCOCO (Common Objects in Context)FID5.82GLIGEN (fine-tuned, Detection data only)
10-shot image generationCOCO (Common Objects in Context)FID6.38GLIGEN (fine-tuned, Grounding data)
10-shot image generationCOCO-MIGinstance success rate0.3Gligen (zero-shot)
10-shot image generationCOCO-MIGmIoU0.27Gligen (zero-shot)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID5.61GLIGEN (fine-tuned, Detection + Caption data)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID5.82GLIGEN (fine-tuned, Detection data only)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID6.38GLIGEN (fine-tuned, Grounding data)
1 Image, 2*2 StitchiCOCO-MIGinstance success rate0.3Gligen (zero-shot)
1 Image, 2*2 StitchiCOCO-MIGmIoU0.27Gligen (zero-shot)

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16CharaConsist: Fine-Grained Consistent Character Generation2025-07-15CATVis: Context-Aware Thought Visualization2025-07-15