TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision Transformer for Fast and Efficient Scene Text Recog...

Vision Transformer for Fast and Efficient Scene Text Recognition

Rowel Atienza

2021-05-18Scene Text RecognitionData Augmentation
PaperPDFCodeCodeCode(official)

Abstract

Scene text recognition (STR) enables computers to read text in natural scenes such as object labels, road signs and instructions. STR helps machines perform informed decisions such as what object to pick, which direction to go, and what is the next step of action. In the body of work on STR, the focus has always been on recognition accuracy. There is little emphasis placed on speed and computational efficiency which are equally important especially for energy-constrained mobile machines. In this paper we propose ViTSTR, an STR with a simple single stage model architecture built on a compute and parameter efficient vision transformer (ViT). On a comparable strong baseline method such as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy of 82.6% (84.2% with data augmentation) at 2.4x speed up, using only 43.4% of the number of parameters and 42.2% FLOPS. The tiny version of ViTSTR achieves 80.3% accuracy (82.1% with data augmentation), at 2.5x the speed, requiring only 10.9% of the number of parameters and 11.9% FLOPS. With data augmentation, our base ViTSTR outperforms TRBA at 85.2% accuracy (83.7% without augmentation) at 2.3x the speed but requires 73.2% more parameters and 61.5% more FLOPS. In terms of trade-offs, nearly all ViTSTR configurations are at or near the frontiers to maximize accuracy, speed and computational efficiency all at the same time.

Results

TaskDatasetMetricValueModel
Scene ParsingSVTAccuracy87.7ViTSTR
Scene ParsingICDAR2015Accuracy72.6ViTSTR
Scene ParsingICDAR 2003Accuracy94.3ViTSTR
Scene ParsingICDAR2013Accuracy92.4ViTSTR
2D Semantic SegmentationSVTAccuracy87.7ViTSTR
2D Semantic SegmentationICDAR2015Accuracy72.6ViTSTR
2D Semantic SegmentationICDAR 2003Accuracy94.3ViTSTR
2D Semantic SegmentationICDAR2013Accuracy92.4ViTSTR
Scene Text RecognitionSVTAccuracy87.7ViTSTR
Scene Text RecognitionICDAR2015Accuracy72.6ViTSTR
Scene Text RecognitionICDAR 2003Accuracy94.3ViTSTR
Scene Text RecognitionICDAR2013Accuracy92.4ViTSTR

Related Papers

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Data Augmentation in Time Series Forecasting through Inverted Framework2025-07-15Iceberg: Enhancing HLS Modeling with Synthetic Data2025-07-14AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation2025-07-08