LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei

2022-04-18Document Layout Analysis Question Answering Relation Extraction Image Classification Representation Learning Semantic entity labeling Masked Language Modeling Entity Linking cross-modal alignment Document AI Document Image Classification Visual Question Answering (VQA)Named Entity Recognition (NER)Key Information Extraction Key-value Pair Extraction Language Modelling Visual Question Answering

Paper PDF Code Code Code(official)Code

Abstract

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.

Results

Task	Dataset	Metric	Value	Model
Entity Linking	EC-FUNSD	F1	78.14	LayoutLMv3 (large)
Entity Linking	EC-FUNSD	F1	78.14	LayoutLMv3 (large)
Entity Linking	EC-FUNSD	F1	67.47	LayoutLMv3 (base)
Entity Linking	EC-FUNSD	F1	67.47	LayoutLMv3 (base)
Named Entity Recognition (NER)	FUNSD-r	F1	78.77	LayoutLMv3
Named Entity Recognition (NER)	CORD-r	F1	82.72	LayoutLMv3
Document Layout Analysis	PubLayNet val	Figure	0.97	LayoutLMv3-B
Document Layout Analysis	PubLayNet val	List	0.955	LayoutLMv3-B
Document Layout Analysis	PubLayNet val	Overall	0.951	LayoutLMv3-B
Document Layout Analysis	PubLayNet val	Table	0.979	LayoutLMv3-B
Document Layout Analysis	PubLayNet val	Text	0.945	LayoutLMv3-B
Document Layout Analysis	PubLayNet val	Title	0.906	LayoutLMv3-B
Document AI	EPHOIE	Average F1	99.21	LayoutLMv3
Semantic entity labeling	EC-FUNSD	F1	83.88	LayoutLMv3 (large)
Semantic entity labeling	EC-FUNSD	F1	83.88	LayoutLMv3 (large)
Semantic entity labeling	EC-FUNSD	F1	82.3	LayoutLMv3 (base)
Semantic entity labeling	EC-FUNSD	F1	82.3	LayoutLMv3 (base)
Semantic entity labeling	FUNSD	F1	92.08	LayoutLMv3 Large
Key Information Extraction	CORD	F1	97.46	LayoutLMv3 Large
Key Information Extraction	EPHOIE	Average F1	99.21	LayoutLMv3
Key Information Extraction	RFUND-EN	key-value pair F1	57.66	LayoutLMv3
Key Information Extraction	SIBR	key-value pair F1	73.51	LayoutLMv3_base_chinese

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Abstract

Results

Related Papers

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Abstract

Results

Related Papers