DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

Nikitha SR, Tarun Ram Menta, Mausoom Sarkar

2024-12-17Document Layout Analysis Document AI Document Image Classification Optical Character Recognition (OCR)

Abstract

The advent of multimodal learning has brought a significant improvement in document AI. Documents are now treated as multimodal entities, incorporating both textual and visual information for downstream analysis. However, works in this space are often focused on the textual aspect, using the visual space as auxiliary information. While some works have explored pure vision based techniques for document image understanding, they require OCR identified text as input during inference, or do not align with text in their learning procedure. Therefore, we present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks. Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during inference. Combined with an auxiliary reconstruction objective, DoPTA consistently outperforms larger models, while using significantly lesser pre-training compute. DoPTA also sets new state-of-the art results on D4LA, and FUNSD, two challenging document visual analysis benchmarks.

Results

Task	Dataset	Metric	Value	Model
Document Layout Analysis	D4LA	mAP	70.72	DoPTA
Document Layout Analysis	PubLayNet val	Figure	0.97	DoPTA
Document Layout Analysis	PubLayNet val	List	0.957	DoPTA
Document Layout Analysis	PubLayNet val	Overall	0.949	DoPTA
Document Layout Analysis	PubLayNet val	Table	0.977	DoPTA
Document Layout Analysis	PubLayNet val	Text	0.944	DoPTA
Document Layout Analysis	PubLayNet val	Title	0.895	DoPTA

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17 Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15 A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14 Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices2025-07-09 Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09 TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision2025-07-08 PaddleOCR 3.0 Technical Report2025-07-08