TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DiffusionSTR: Diffusion Model for Scene Text Recognition

DiffusionSTR: Diffusion Model for Scene Text Recognition

Masato Fujitake

2023-06-29Scene Text RecognitionImage to text
PaperPDF

Abstract

This paper presents Diffusion Model for Scene Text Recognition (DiffusionSTR), an end-to-end text recognition framework using diffusion models for recognizing text in the wild. While existing studies have viewed the scene text recognition task as an image-to-text transformation, we rethought it as a text-text one under images in a diffusion model. We show for the first time that the diffusion model can be applied to text recognition. Furthermore, experimental results on publicly available datasets show that the proposed method achieves competitive accuracy compared to state-of-the-art methods.

Results

TaskDatasetMetricValueModel
Scene ParsingSVTAccuracy93.6DiffusionSTR
Scene ParsingSVTPAccuracy89.2DiffusionSTR
Scene ParsingCUTE80Accuracy92.5DiffusionSTR
Scene ParsingICDAR2015Accuracy86DiffusionSTR
Scene ParsingIIIT5kAccuracy97.3DiffusionSTR
Scene ParsingICDAR2013Accuracy97.1DiffusionSTR
2D Semantic SegmentationSVTAccuracy93.6DiffusionSTR
2D Semantic SegmentationSVTPAccuracy89.2DiffusionSTR
2D Semantic SegmentationCUTE80Accuracy92.5DiffusionSTR
2D Semantic SegmentationICDAR2015Accuracy86DiffusionSTR
2D Semantic SegmentationIIIT5kAccuracy97.3DiffusionSTR
2D Semantic SegmentationICDAR2013Accuracy97.1DiffusionSTR
Scene Text RecognitionSVTAccuracy93.6DiffusionSTR
Scene Text RecognitionSVTPAccuracy89.2DiffusionSTR
Scene Text RecognitionCUTE80Accuracy92.5DiffusionSTR
Scene Text RecognitionICDAR2015Accuracy86DiffusionSTR
Scene Text RecognitionIIIT5kAccuracy97.3DiffusionSTR
Scene Text RecognitionICDAR2013Accuracy97.1DiffusionSTR

Related Papers

Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration2025-06-12ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering2025-06-11Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models2025-06-10BRIT: Bidirectional Retrieval over Unified Image-Text Graph2025-05-24TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP2025-05-24Robustifying Vision-Language Models via Dynamic Token Reweighting2025-05-22UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings2025-05-17Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution2025-05-16