TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Benchmarking Vision-Language Models on Optical Character R...

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

Sankalp Nagaonkar, Augustya Sharma, Ashish Choithani, Ashutosh Trivedi

2025-02-10BenchmarkingOptical Character Recognition (OCR)
PaperPDFCode(official)

Abstract

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

Results

TaskDatasetMetricValueModel
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionAverage Accuracy76.22GPT-4o
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionCharacter Error Rate (CER)0.2378GPT-4o
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionWord Error Rate (WER)0.5117GPT-4o
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionAverage Accuracy76.13Gemini-1.5 Pro
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionCharacter Error Rate (CER)0.2387Gemini-1.5 Pro
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionWord Error Rate (WER)0.2385Gemini-1.5 Pro
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionAverage Accuracy67.71Claude-3 Sonnet
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionCharacter Error Rate (CER)0.3229Claude-3 Sonnet
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionWord Error Rate (WER)0.4663Claude-3 Sonnet
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionAverage Accuracy56.98RapidOCR
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionCharacter Error Rate (CER)0.762RapidOCR
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionWord Error Rate (WER)0.4302RapidOCR
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionAverage Accuracy49.3EasyOCR
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionCharacter Error Rate (CER)0.507EasyOCR
Optical Character Recognition (OCR)VideoDB's OCR Benchmark Public CollectionWord Error Rate (WER)0.8262EasyOCR

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20Training Transformers with Enforced Lipschitz Constants2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15