VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

Mohamed Kerroumi, Othmane Sayem, Aymen Shabou

2020-10-05Document Layout Analysis

Abstract

We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.

Results

Task	Dataset	Metric	Value	Model
Document Layout Analysis	RVL-CDIP	FAR	28.7	VisualWordGrid
Document Layout Analysis	RVL-CDIP	WAR	18.7	VisualWordGrid

Related Papers

Class-Agnostic Region-of-Interest Matching in Document Images2025-06-26 From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents2025-06-25 SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation2025-05-20 A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court2025-05-13 Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs2025-05-12 AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization2025-03-28 SFDLA: Source-Free Document Layout Analysis2025-03-24 PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction2025-03-21