DocBank: A Benchmark Dataset for Document Layout Analysis

Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, Ming Zhou

2020-06-01COLING 2020 8Document Layout Analysis

Abstract

Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present \textbf{DocBank}, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the \LaTeX{} documents available on the arXiv.com. With DocBank, models from different modalities can be compared fairly and multi-modal approaches will be further investigated and boost the performance of document layout analysis. We build several strong baselines and manually split train/dev/test sets for evaluation. Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents. The DocBank dataset is publicly available at \url{https://github.com/doc-analysis/DocBank}.

Related Papers

Class-Agnostic Region-of-Interest Matching in Document Images2025-06-26 From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents2025-06-25 SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation2025-05-20 A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court2025-05-13 Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs2025-05-12 AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization2025-03-28 SFDLA: Source-Free Document Layout Analysis2025-03-24 PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction2025-03-21