TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Prompt-to-Prompt Image Editing with Cross Attention Control

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or

2022-08-02Image GenerationText-based Image Editing
PaperPDFCodeCodeCodeCode(official)CodeCodeCode

Abstract

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.

Results

TaskDatasetMetricValueModel
Image GenerationPIE-BenchBackground LPIPS208.8DDIM Inversion+Prompt-to-Prompt
Image GenerationPIE-BenchBackground PSNR17.87DDIM Inversion+Prompt-to-Prompt
Image GenerationPIE-BenchCLIPSIM25.01DDIM Inversion+Prompt-to-Prompt
Image GenerationPIE-BenchStructure Distance69.43DDIM Inversion+Prompt-to-Prompt
Text-to-Image GenerationPIE-BenchBackground LPIPS208.8DDIM Inversion+Prompt-to-Prompt
Text-to-Image GenerationPIE-BenchBackground PSNR17.87DDIM Inversion+Prompt-to-Prompt
Text-to-Image GenerationPIE-BenchCLIPSIM25.01DDIM Inversion+Prompt-to-Prompt
Text-to-Image GenerationPIE-BenchStructure Distance69.43DDIM Inversion+Prompt-to-Prompt
10-shot image generationPIE-BenchBackground LPIPS208.8DDIM Inversion+Prompt-to-Prompt
10-shot image generationPIE-BenchBackground PSNR17.87DDIM Inversion+Prompt-to-Prompt
10-shot image generationPIE-BenchCLIPSIM25.01DDIM Inversion+Prompt-to-Prompt
10-shot image generationPIE-BenchStructure Distance69.43DDIM Inversion+Prompt-to-Prompt
1 Image, 2*2 StitchiPIE-BenchBackground LPIPS208.8DDIM Inversion+Prompt-to-Prompt
1 Image, 2*2 StitchiPIE-BenchBackground PSNR17.87DDIM Inversion+Prompt-to-Prompt
1 Image, 2*2 StitchiPIE-BenchCLIPSIM25.01DDIM Inversion+Prompt-to-Prompt
1 Image, 2*2 StitchiPIE-BenchStructure Distance69.43DDIM Inversion+Prompt-to-Prompt

Related Papers

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining2025-07-18fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16CharaConsist: Fine-Grained Consistent Character Generation2025-07-15