TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DeepSolo: Let Transformer Decoder with Explicit Points Sol...

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, DaCheng Tao

2022-11-19CVPR 2023 1Text MatchingScene Text DetectionText SpottingText Detection
PaperPDFCode(official)

Abstract

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code is available at https://github.com/ViTAE-Transformer/DeepSolo.

Results

TaskDatasetMetricValueModel
Text SpottingTotal-TextF-measure (%) - Full Lexicon89.6DeepSolo (ViTAEv2-S, TextOCR)
Text SpottingTotal-TextF-measure (%) - No Lexicon83.6DeepSolo (ViTAEv2-S, TextOCR)
Text SpottingTotal-TextF-measure (%) - Full Lexicon88.7DeepSolo (ResNet-50, TextOCR)
Text SpottingTotal-TextF-measure (%) - No Lexicon82.5DeepSolo (ResNet-50, TextOCR)
Text SpottingTotal-TextF-measure (%) - Full Lexicon87DeepSolo (ResNet-50)
Text SpottingTotal-TextF-measure (%) - No Lexicon79.7DeepSolo (ResNet-50)
Text SpottingICDAR 2015F-measure (%) - Generic Lexicon79.5DeepSolo (ViTAEv2-S, TextOCR)
Text SpottingICDAR 2015F-measure (%) - Strong Lexicon88.1DeepSolo (ViTAEv2-S, TextOCR)
Text SpottingICDAR 2015F-measure (%) - Weak Lexicon83.9DeepSolo (ViTAEv2-S, TextOCR)
Text SpottingICDAR 2015F-measure (%) - Generic Lexicon79.1DeepSolo(ResNet-50, TextOCR)
Text SpottingICDAR 2015F-measure (%) - Strong Lexicon88DeepSolo(ResNet-50, TextOCR)
Text SpottingICDAR 2015F-measure (%) - Weak Lexicon83.5DeepSolo(ResNet-50, TextOCR)
Text SpottingICDAR 2015F-measure (%) - Generic Lexicon76.9DeepSolo(ResNet-50)
Text SpottingICDAR 2015F-measure (%) - Strong Lexicon86.8DeepSolo(ResNet-50)
Text SpottingICDAR 2015F-measure (%) - Weak Lexicon81.9DeepSolo(ResNet-50)

Related Papers

AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models2025-07-07PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning2025-06-18Text-Aware Image Restoration with Diffusion Models2025-06-11Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models2025-06-10Task-driven real-world super-resolution of document scans2025-06-08CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media2025-06-05Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors2025-05-30GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking2025-05-28