Levenshtein OCR

Cheng Da, Peng Wang, Cong Yao

2022-09-08Scene Text Recognition Imitation Learning Optical Character Recognition (OCR)

Abstract

A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR.

Related Papers

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner2025-07-17 Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17 Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15 A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14 Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices2025-07-09 Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09