Vision Grid Transformer for Document Layout Analysis

Cheng Da, Chuwei Luo, Qi Zheng, Cong Yao

2023-08-29ICCV 2023 1Document Layout Analysis document understanding Document AI Optical Character Recognition (OCR)

Abstract

Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D$^4$LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet ($95.7\%$$\rightarrow$$96.2\%$), DocBank ($79.6\%$$\rightarrow$$84.1\%$), and D$^4$LA ($67.7\%$$\rightarrow$$68.8\%$). The code and models as well as the D$^4$LA dataset will be made publicly available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.

Results

Task	Dataset	Metric	Value	Model
Document Layout Analysis	D4LA	mAP	68.8	VGT
Document Layout Analysis	PubLayNet val	Figure	0.971	VGT
Document Layout Analysis	PubLayNet val	List	0.968	VGT
Document Layout Analysis	PubLayNet val	Overall	0.962	VGT
Document Layout Analysis	PubLayNet val	Table	0.981	VGT
Document Layout Analysis	PubLayNet val	Text	0.95	VGT
Document Layout Analysis	PubLayNet val	Title	0.939	VGT
Document Layout Analysis	PubLayNet val	Figure	0.968	ResNext-101-32×8d
Document Layout Analysis	PubLayNet val	List	0.94	ResNext-101-32×8d
Document Layout Analysis	PubLayNet val	Overall	0.935	ResNext-101-32×8d
Document Layout Analysis	PubLayNet val	Table	0.976	ResNext-101-32×8d
Document Layout Analysis	PubLayNet val	Text	0.93	ResNext-101-32×8d
Document Layout Analysis	PubLayNet val	Title	0.862	ResNext-101-32×8d

Vision Grid Transformer for Document Layout Analysis

Abstract

Results

Related Papers

Vision Grid Transformer for Document Layout Analysis

Abstract

Results

Related Papers