TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CDistNet: Perceiving Multi-Domain Character Distance for R...

CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, Yu-Gang Jiang

2021-11-22Scene Text Recognition
PaperPDFCodeCodeCode(official)

Abstract

The Transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the position branch, generating a content-aware embedding that well perceives character spacing and orientation variants, character semantic affinities, and clues tying the two kinds of information. They are summarized as the multi-domain character distance. We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well built even various recognition difficulties are presented. We verify CDistNet on ten challenging public datasets and two series of augmented datasets created by ourselves. The experiments demonstrate that CDistNet performs highly competitively. It not only ranks top-tier in standard benchmarks, but also outperforms recent popular methods by obvious margins on real and augmented datasets presenting severe text deformation, poor linguistic support, and rare character layouts. Code is available at https://github.com/simplify23/CDistNet.

Results

TaskDatasetMetricValueModel
Scene ParsingSVTAccuracy93.82CDistNet (Ours)
Scene ParsingSVTPAccuracy89.77CDistNet (Ours)
Scene ParsingCUTE80Accuracy89.58CDistNet (Ours)
Scene ParsingICDAR2015Accuracy86.25CDistNet (Ours)
Scene ParsingIIIT5kAccuracy96.57CDistNet (Ours)
Scene ParsingICDAR2013Accuracy97.67CDistNet (Ours)
2D Semantic SegmentationSVTAccuracy93.82CDistNet (Ours)
2D Semantic SegmentationSVTPAccuracy89.77CDistNet (Ours)
2D Semantic SegmentationCUTE80Accuracy89.58CDistNet (Ours)
2D Semantic SegmentationICDAR2015Accuracy86.25CDistNet (Ours)
2D Semantic SegmentationIIIT5kAccuracy96.57CDistNet (Ours)
2D Semantic SegmentationICDAR2013Accuracy97.67CDistNet (Ours)
Scene Text RecognitionSVTAccuracy93.82CDistNet (Ours)
Scene Text RecognitionSVTPAccuracy89.77CDistNet (Ours)
Scene Text RecognitionCUTE80Accuracy89.58CDistNet (Ours)
Scene Text RecognitionICDAR2015Accuracy86.25CDistNet (Ours)
Scene Text RecognitionIIIT5kAccuracy96.57CDistNet (Ours)
Scene Text RecognitionICDAR2013Accuracy97.67CDistNet (Ours)

Related Papers

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition2025-03-24Efficient and Accurate Scene Text Recognition with Cascaded-Transformers2025-03-24Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation2025-03-20A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition2025-03-19EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition2025-02-13Billet Number Recognition Based on Test-Time Adaptation2025-02-13Ocean-OCR: Towards General OCR Application via a Vision-Language Model2025-01-26Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance2024-12-13