Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Relation Extraction | FUNSD | F1 | 70.57 | LayoutLMv2 large |
| Visual Question Answering (VQA) | DocVQA test | ANLS | 0.8672 | LayoutLMv2LARGE |
| Visual Question Answering (VQA) | DocVQA test | ANLS | 0.7808 | LayoutLMv2BASE |
| Semantic entity labeling | FUNSD | F1 | 84.2 | LayoutLMv2LARGE |
| Semantic entity labeling | FUNSD | F1 | 82.76 | LayoutLMv2BASE |
| Key Information Extraction | CORD | F1 | 96.01 | LayoutLMv2LARGE |
| Key Information Extraction | CORD | F1 | 94.95 | LayoutLMv2BASE |
| Key Information Extraction | Kleister NDA | F1 | 85.2 | LayoutLMv2LARGE |
| Key Information Extraction | Kleister NDA | F1 | 83.3 | LayoutLMv2BASE |
| Key Information Extraction | SROIE | F1 | 97.81 | LayoutLMv2LARGE (Excluding OCR mismatch) |
| Key Information Extraction | SROIE | F1 | 96.61 | LayoutLMv2LARGE |
| Key Information Extraction | SROIE | F1 | 96.25 | LayoutLMv2BASE |
| Key Information Extraction | RFUND-EN | key-value pair F1 | 49.06 | LayoutLMv2_base |