TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision Grid Transformer for Document Layout Analysis

Vision Grid Transformer for Document Layout Analysis

Cheng Da, Chuwei Luo, Qi Zheng, Cong Yao

2023-08-29ICCV 2023 1Document Layout Analysisdocument understandingDocument AIOptical Character Recognition (OCR)
PaperPDFCode(official)

Abstract

Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D$^4$LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet ($95.7\%$$\rightarrow$$96.2\%$), DocBank ($79.6\%$$\rightarrow$$84.1\%$), and D$^4$LA ($67.7\%$$\rightarrow$$68.8\%$). The code and models as well as the D$^4$LA dataset will be made publicly available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.

Results

TaskDatasetMetricValueModel
Document Layout AnalysisD4LA mAP68.8VGT
Document Layout AnalysisPubLayNet valFigure0.971VGT
Document Layout AnalysisPubLayNet valList0.968VGT
Document Layout AnalysisPubLayNet valOverall0.962VGT
Document Layout AnalysisPubLayNet valTable0.981VGT
Document Layout AnalysisPubLayNet valText0.95VGT
Document Layout AnalysisPubLayNet valTitle0.939VGT
Document Layout AnalysisPubLayNet valFigure0.968ResNext-101-32×8d
Document Layout AnalysisPubLayNet valList0.94ResNext-101-32×8d
Document Layout AnalysisPubLayNet valOverall0.935ResNext-101-32×8d
Document Layout AnalysisPubLayNet valTable0.976ResNext-101-32×8d
Document Layout AnalysisPubLayNet valText0.93ResNext-101-32×8d
Document Layout AnalysisPubLayNet valTitle0.862ResNext-101-32×8d

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices2025-07-09Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09PaddleOCR 3.0 Technical Report2025-07-08TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision2025-07-08