A3S: Adversarial learning of semantic representations for Scene-Text Spotting

Masato Fujitake

2023-02-21Text Spotting

Abstract

Scene-text spotting is a task that predicts a text area on natural scene images and recognizes its text characters simultaneously. It has attracted much attention in recent years due to its wide applications. Existing research has mainly focused on improving text region detection, not text recognition. Thus, while detection accuracy is improved, the end-to-end accuracy is insufficient. Texts in natural scene images tend to not be a random string of characters but a meaningful string of characters, a word. Therefore, we propose adversarial learning of semantic representations for scene text spotting (A3S) to improve end-to-end accuracy, including text recognition. A3S simultaneously predicts semantic features in the detected text area instead of only performing text recognition based on existing visual features. Experimental results on publicly available datasets show that the proposed method achieves better accuracy than other methods.

Results

Task	Dataset	Metric	Value	Model
Text Spotting	Total-Text	F-measure (%) - Full Lexicon	85.1	A3S
Text Spotting	Total-Text	F-measure (%) - No Lexicon	79.4	A3S
Text Spotting	SCUT-CTW1500	F-Measure (%) - Full Lexicon	82.3	A3S
Text Spotting	SCUT-CTW1500	F-measure (%) - No Lexicon	64.4	A3S
Text Spotting	ICDAR 2015	F-measure (%) - Generic Lexicon	79.6	A3S
Text Spotting	ICDAR 2015	F-measure (%) - Strong Lexicon	84.8	A3S
Text Spotting	ICDAR 2015	F-measure (%) - Weak Lexicon	83.7	A3S

Related Papers

Text-Aware Image Restoration with Diffusion Models2025-06-11 GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking2025-05-28 SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting2025-04-14 TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification2025-03-09 OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models2025-02-22 CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR2025-01-01 Hear the Scene: Audio-Enhanced Text Spotting2024-12-27 InstructOCR: Instruction Boosting Scene Text Spotting2024-12-20