LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

2020-12-29ACL 2021 5Document Layout Analysis Relation Extraction document understanding Semantic entity labeling Document Image Classification Visual Question Answering (VQA)Key Information Extraction Key-value Pair Extraction Language Modelling Visual Question Answering

Paper PDF Code Code Code Code Code Code Code Code Code(official)

Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	FUNSD	F1	70.57	LayoutLMv2 large
Visual Question Answering (VQA)	DocVQA test	ANLS	0.8672	LayoutLMv2LARGE
Visual Question Answering (VQA)	DocVQA test	ANLS	0.7808	LayoutLMv2BASE
Semantic entity labeling	FUNSD	F1	84.2	LayoutLMv2LARGE
Semantic entity labeling	FUNSD	F1	82.76	LayoutLMv2BASE
Key Information Extraction	CORD	F1	96.01	LayoutLMv2LARGE
Key Information Extraction	CORD	F1	94.95	LayoutLMv2BASE
Key Information Extraction	Kleister NDA	F1	85.2	LayoutLMv2LARGE
Key Information Extraction	Kleister NDA	F1	83.3	LayoutLMv2BASE
Key Information Extraction	SROIE	F1	97.81	LayoutLMv2LARGE (Excluding OCR mismatch)
Key Information Extraction	SROIE	F1	96.61	LayoutLMv2LARGE
Key Information Extraction	SROIE	F1	96.25	LayoutLMv2BASE
Key Information Extraction	RFUND-EN	key-value pair F1	49.06	LayoutLMv2_base

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Abstract

Results

Related Papers

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Abstract

Results

Related Papers