Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Entity Linking | EC-FUNSD | F1 | 78.14 | LayoutLMv3 (large) |
| Entity Linking | EC-FUNSD | F1 | 78.14 | LayoutLMv3 (large) |
| Entity Linking | EC-FUNSD | F1 | 67.47 | LayoutLMv3 (base) |
| Entity Linking | EC-FUNSD | F1 | 67.47 | LayoutLMv3 (base) |
| Named Entity Recognition (NER) | FUNSD-r | F1 | 78.77 | LayoutLMv3 |
| Named Entity Recognition (NER) | CORD-r | F1 | 82.72 | LayoutLMv3 |
| Document Layout Analysis | PubLayNet val | Figure | 0.97 | LayoutLMv3-B |
| Document Layout Analysis | PubLayNet val | List | 0.955 | LayoutLMv3-B |
| Document Layout Analysis | PubLayNet val | Overall | 0.951 | LayoutLMv3-B |
| Document Layout Analysis | PubLayNet val | Table | 0.979 | LayoutLMv3-B |
| Document Layout Analysis | PubLayNet val | Text | 0.945 | LayoutLMv3-B |
| Document Layout Analysis | PubLayNet val | Title | 0.906 | LayoutLMv3-B |
| Document AI | EPHOIE | Average F1 | 99.21 | LayoutLMv3 |
| Semantic entity labeling | EC-FUNSD | F1 | 83.88 | LayoutLMv3 (large) |
| Semantic entity labeling | EC-FUNSD | F1 | 83.88 | LayoutLMv3 (large) |
| Semantic entity labeling | EC-FUNSD | F1 | 82.3 | LayoutLMv3 (base) |
| Semantic entity labeling | EC-FUNSD | F1 | 82.3 | LayoutLMv3 (base) |
| Semantic entity labeling | FUNSD | F1 | 92.08 | LayoutLMv3 Large |
| Key Information Extraction | CORD | F1 | 97.46 | LayoutLMv3 Large |
| Key Information Extraction | EPHOIE | Average F1 | 99.21 | LayoutLMv3 |
| Key Information Extraction | RFUND-EN | key-value pair F1 | 57.66 | LayoutLMv3 |
| Key Information Extraction | SIBR | key-value pair F1 | 73.51 | LayoutLMv3_base_chinese |