TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Semantics Allow for Textual Reasoning Better in Sce...

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang, Bo Du

2021-12-24AAAI 2022 2021 12Scene Text RecognitionLanguage Modelling
PaperPDFCode(official)

Abstract

Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR can be easily plugged in representative STR models to improve their performance owing to better textual reasoning. Specifically, we construct our model, namely S-GTR, by paralleling GTR to the language model in a segmentation-based STR baseline, which can effectively exploit the visual-linguistic complementarity via mutual learning. S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Code is available at https://github.com/adeline-cs/GTR.

Results

TaskDatasetMetricValueModel
Scene ParsingSVTAccuracy95.8S-GTR
Scene ParsingSVTPAccuracy90.6S-GTR
Scene ParsingCUTE80Accuracy94.7S-GTR
Scene ParsingICDAR2015Accuracy87.3S-GTR
Scene ParsingIIIT5kAccuracy97.5S-GTR
Scene ParsingICDAR2013Accuracy97.8S-GTR
2D Semantic SegmentationSVTAccuracy95.8S-GTR
2D Semantic SegmentationSVTPAccuracy90.6S-GTR
2D Semantic SegmentationCUTE80Accuracy94.7S-GTR
2D Semantic SegmentationICDAR2015Accuracy87.3S-GTR
2D Semantic SegmentationIIIT5kAccuracy97.5S-GTR
2D Semantic SegmentationICDAR2013Accuracy97.8S-GTR
Scene Text RecognitionSVTAccuracy95.8S-GTR
Scene Text RecognitionSVTPAccuracy90.6S-GTR
Scene Text RecognitionCUTE80Accuracy94.7S-GTR
Scene Text RecognitionICDAR2015Accuracy87.3S-GTR
Scene Text RecognitionIIIT5kAccuracy97.5S-GTR
Scene Text RecognitionICDAR2013Accuracy97.8S-GTR

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16