Masato Fujitake
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Optical Character Recognition (OCR) | Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study | Accuracy (%) | 89.6 | DTrOCR |
| Optical Character Recognition (OCR) | Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study | Accuracy (%) | 89.6 | DTrOCR 105M |
| Optical Character Recognition (OCR) | IAM | CER | 2.38 | DTrOCR 105M |
| Scene Parsing | SVT | Accuracy | 98.9 | DTrOCR 105M |
| Scene Parsing | SVTP | Accuracy | 98.6 | DTrOCR 105M |
| Scene Parsing | CUTE80 | Accuracy | 99.1 | DTrOCR 105M |
| Scene Parsing | ICDAR2015 | Accuracy | 93.5 | DTrOCR 105M |
| Scene Parsing | IIIT5k | Accuracy | 99.6 | DTrOCR 105M |
| Scene Parsing | ICDAR2013 | Accuracy | 99.4 | DTrOCR 105M |
| 2D Semantic Segmentation | SVT | Accuracy | 98.9 | DTrOCR 105M |
| 2D Semantic Segmentation | SVTP | Accuracy | 98.6 | DTrOCR 105M |
| 2D Semantic Segmentation | CUTE80 | Accuracy | 99.1 | DTrOCR 105M |
| 2D Semantic Segmentation | ICDAR2015 | Accuracy | 93.5 | DTrOCR 105M |
| 2D Semantic Segmentation | IIIT5k | Accuracy | 99.6 | DTrOCR 105M |
| 2D Semantic Segmentation | ICDAR2013 | Accuracy | 99.4 | DTrOCR 105M |
| Handwritten Text Recognition | IAM | CER | 2.38 | DTrOCR 105M |
| Scene Text Recognition | SVT | Accuracy | 98.9 | DTrOCR 105M |
| Scene Text Recognition | SVTP | Accuracy | 98.6 | DTrOCR 105M |
| Scene Text Recognition | CUTE80 | Accuracy | 99.1 | DTrOCR 105M |
| Scene Text Recognition | ICDAR2015 | Accuracy | 93.5 | DTrOCR 105M |
| Scene Text Recognition | IIIT5k | Accuracy | 99.6 | DTrOCR 105M |
| Scene Text Recognition | ICDAR2013 | Accuracy | 99.4 | DTrOCR 105M |