TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/AnyText: Multilingual Visual Text Generation And Editing

AnyText: Multilingual Visual Text Generation And Editing

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, Xuansong Xie

2023-11-06Text GenerationImage GenerationOptical Character Recognition (OCR)
PaperPDFCode(official)

Abstract

Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.

Results

TaskDatasetMetricValueModel
Image GenerationTextAtlasEvalStyledTextSynth Clip Score0.2501Anytext
Image GenerationTextAtlasEvalStyledTextSynth FID117.71Anytext
Image GenerationTextAtlasEvalStyledTextSynth OCR (Accuracy)0.35Anytext
Image GenerationTextAtlasEvalStyledTextSynth OCR (Cer)0.98Anytext
Image GenerationTextAtlasEvalStyledTextSynth OCR (F1 Score)0.66Anytext
Image GenerationTextAtlasEvalTextScenesHQ Clip Score0.2174Anytext
Image GenerationTextAtlasEvalTextScenesHQ FID101.32Anytext
Image GenerationTextAtlasEvalTextScenesHQ OCR (Accuracy)0.42Anytext
Image GenerationTextAtlasEvalTextScenesHQ OCR (Cer)0.95Anytext
Image GenerationTextAtlasEvalTextScenesHQ OCR (F1 Score)0.8Anytext

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17