Papers With Code 2 | ML Benchmarks, SotA Results & Code

Dataset Introduction

This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It spans multiple categories—ranging from finance and legal documents to software UI elements and handwritten notes—ensuring a broad representation of real-world text appearances. Each video is annotated with frame indexes to facilitate consistent and reproducible OCR benchmarks. Currently, the dataset includes over 25 curated videos, yielding thousands of extracted frames that present a variety of text-related challenges.

Key Features

Diverse Text Genres
- Finance/Business: Includes news tickers and stock market visuals where text scrolls rapidly.
- Legal/Educational: Features documents with formal language, diagrams, and formatted text.
- Software/Web Development/UI: Shows on-screen code editors, browser windows, and other UI elements that test OCR's ability to handle varying font sizes and code snippets.
- Handwriting: Encompasses both cursive and print handwriting on whiteboards or paper, capturing the challenges of style variability and penmanship.
- Miscellaneous/Other: Covers signage, billboards, and everyday text in the wild.
Rich Annotation
- Each video includes frame indexes or scene timestamps to ensure consistent, reproducible extraction of text segments.
- Ground truth (OCR text) is provided for thousands of extracted frames, facilitating quantitative performance evaluations (e.g., CER, WER).
Benchmark-Ready
- The dataset seamlessly integrates with the ocr-benchmark repository to streamline model evaluation.
- Scripts are included for frame extraction, automatic OCR comparison, and metric calculation (CER, WER, accuracy).

How to Access

VideoDB Public Collection ID: c-c0a2c223-e377-4625-94bf-910501c2a31c
Simply reference this ID within VideoDB to retrieve and review the videos.
Ground Truth Files: Located in the ocr_ground_truths directory. The JSON files map each frame to its corresponding textual annotations.

For detailed instructions on working with VideoDB Public Collections, please refer to the official documentation.

Licensing and Usage

Usage Restrictions: The videos are publicly accessible for research and educational use.
Attribution: Please cite the Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments paper or repository if you use this dataset in your work.

VideoDB's OCR Benchmark Public Collection

Dataset Introduction

Key Features

How to Access

Licensing and Usage

Benchmarks