TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DTrOCR: Decoder-only Transformer for Optical Character Rec...

DTrOCR: Decoder-only Transformer for Optical Character Recognition

Masato Fujitake

2023-08-30Handwritten Text RecognitionScene Text RecognitionLanguage ModellingOptical Character Recognition (OCR)Task 2
PaperPDFCode

Abstract

Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.

Results

TaskDatasetMetricValueModel
Optical Character Recognition (OCR)Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical StudyAccuracy (%)89.6DTrOCR
Optical Character Recognition (OCR)Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical StudyAccuracy (%)89.6DTrOCR 105M
Optical Character Recognition (OCR)IAMCER2.38DTrOCR 105M
Scene ParsingSVTAccuracy98.9DTrOCR 105M
Scene ParsingSVTPAccuracy98.6DTrOCR 105M
Scene ParsingCUTE80Accuracy99.1DTrOCR 105M
Scene ParsingICDAR2015Accuracy93.5DTrOCR 105M
Scene ParsingIIIT5kAccuracy99.6DTrOCR 105M
Scene ParsingICDAR2013Accuracy99.4DTrOCR 105M
2D Semantic SegmentationSVTAccuracy98.9DTrOCR 105M
2D Semantic SegmentationSVTPAccuracy98.6DTrOCR 105M
2D Semantic SegmentationCUTE80Accuracy99.1DTrOCR 105M
2D Semantic SegmentationICDAR2015Accuracy93.5DTrOCR 105M
2D Semantic SegmentationIIIT5kAccuracy99.6DTrOCR 105M
2D Semantic SegmentationICDAR2013Accuracy99.4DTrOCR 105M
Handwritten Text RecognitionIAMCER2.38DTrOCR 105M
Scene Text RecognitionSVTAccuracy98.9DTrOCR 105M
Scene Text RecognitionSVTPAccuracy98.6DTrOCR 105M
Scene Text RecognitionCUTE80Accuracy99.1DTrOCR 105M
Scene Text RecognitionICDAR2015Accuracy93.5DTrOCR 105M
Scene Text RecognitionIIIT5kAccuracy99.6DTrOCR 105M
Scene Text RecognitionICDAR2013Accuracy99.4DTrOCR 105M

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16