TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DeepSolo++: Let Transformer Decoder with Explicit Points S...

DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, DaCheng Tao

2023-05-31Scene Text DetectionText SpottingText Detection
PaperPDFCode(official)

Abstract

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. Besides, they overlook the exploring on multilingual text spotting which requires an extra script identification task. In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Furthermore, we show the surprisingly good extensibility of our method, in terms of character class, language type, and task. On the one hand, our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese. On the other hand, our DeepSolo++ achieves better performance on the additionally introduced script identification task with a simpler training pipeline compared with previous methods. In addition, our models are also compatible with line annotations, which require much less annotation cost than polygons. The code is available at \url{https://github.com/ViTAE-Transformer/DeepSolo}.

Results

TaskDatasetMetricValueModel
Text SpottingInverse-TextF-measure (%) - Full Lexicon75.8DeepSolo (ViTAEv2-S, TextOCR)
Text SpottingInverse-TextF-measure (%) - No Lexicon68.8DeepSolo (ViTAEv2-S, TextOCR)
Text SpottingInverse-TextF-measure (%) - Full Lexicon71.2DeepSolo (ResNet-50, TextOCR)
Text SpottingInverse-TextF-measure (%) - No Lexicon64.6DeepSolo (ResNet-50, TextOCR)
Text SpottingInverse-TextF-measure (%) - Full Lexicon53.9DeepSolo (ResNet-50)
Text SpottingInverse-TextF-measure (%) - No Lexicon48.5DeepSolo (ResNet-50)
Text SpottingSCUT-CTW1500F-Measure (%) - Full Lexicon81.4DeepSolo (ResNet-50)
Text SpottingSCUT-CTW1500F-measure (%) - No Lexicon64.2DeepSolo (ResNet-50)

Related Papers

AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models2025-07-07PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning2025-06-18Text-Aware Image Restoration with Diffusion Models2025-06-11Task-driven real-world super-resolution of document scans2025-06-08CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media2025-06-05Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors2025-05-30GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking2025-05-28The Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection2025-05-21