Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang, Bo Du

2021-12-24AAAI 2022 2021 12Scene Text Recognition Language Modelling

Abstract

Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR can be easily plugged in representative STR models to improve their performance owing to better textual reasoning. Specifically, we construct our model, namely S-GTR, by paralleling GTR to the language model in a segmentation-based STR baseline, which can effectively exploit the visual-linguistic complementarity via mutual learning. S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Code is available at https://github.com/adeline-cs/GTR.

Results

Task	Dataset	Metric	Value	Model
Scene Parsing	SVT	Accuracy	95.8	S-GTR
Scene Parsing	SVTP	Accuracy	90.6	S-GTR
Scene Parsing	CUTE80	Accuracy	94.7	S-GTR
Scene Parsing	ICDAR2015	Accuracy	87.3	S-GTR
Scene Parsing	IIIT5k	Accuracy	97.5	S-GTR
Scene Parsing	ICDAR2013	Accuracy	97.8	S-GTR
2D Semantic Segmentation	SVT	Accuracy	95.8	S-GTR
2D Semantic Segmentation	SVTP	Accuracy	90.6	S-GTR
2D Semantic Segmentation	CUTE80	Accuracy	94.7	S-GTR
2D Semantic Segmentation	ICDAR2015	Accuracy	87.3	S-GTR
2D Semantic Segmentation	IIIT5k	Accuracy	97.5	S-GTR
2D Semantic Segmentation	ICDAR2013	Accuracy	97.8	S-GTR
Scene Text Recognition	SVT	Accuracy	95.8	S-GTR
Scene Text Recognition	SVTP	Accuracy	90.6	S-GTR
Scene Text Recognition	CUTE80	Accuracy	94.7	S-GTR
Scene Text Recognition	ICDAR2015	Accuracy	87.3	S-GTR
Scene Text Recognition	IIIT5k	Accuracy	97.5	S-GTR
Scene Text Recognition	ICDAR2013	Accuracy	97.8	S-GTR

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Abstract

Results

Related Papers

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Abstract

Results

Related Papers