DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei

2022-03-04Document Layout Analysis Image Classification Table Detection Document AI Document Image Classification Text Detection Optical Character Recognition (OCR)

Paper PDF Code Code Code(official)Code

Abstract

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.

Results

Task	Dataset	Metric	Value	Model
Document Layout Analysis	PubLayNet val	Figure	0.972	DiT-L
Document Layout Analysis	PubLayNet val	List	0.96	DiT-L
Document Layout Analysis	PubLayNet val	Overall	0.949	DiT-L
Document Layout Analysis	PubLayNet val	Table	0.978	DiT-L
Document Layout Analysis	PubLayNet val	Text	0.944	DiT-L
Document Layout Analysis	PubLayNet val	Title	0.893	DiT-L
Table Detection	ICDAR 2019	Weighted Average F1-score	96.55	DiT-L (Cascade)
Table Detection	ICDAR 2019	Weighted Average F1-score	96.14	DiT-B (Cascade)

DiT: Self-supervised Pre-training for Document Image Transformer

Abstract

Results

Related Papers

DiT: Self-supervised Pre-training for Document Image Transformer

Abstract

Results

Related Papers