TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LayoutLMv2: Multi-modal Pre-training for Visually-Rich Doc...

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

2020-12-29ACL 2021 5Document Layout AnalysisRelation Extractiondocument understandingSemantic entity labelingDocument Image ClassificationVisual Question Answering (VQA)Key Information ExtractionKey-value Pair ExtractionLanguage ModellingVisual Question Answering
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCode(official)

Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.

Results

TaskDatasetMetricValueModel
Relation ExtractionFUNSDF170.57LayoutLMv2 large
Visual Question Answering (VQA)DocVQA testANLS0.8672LayoutLMv2LARGE
Visual Question Answering (VQA)DocVQA testANLS0.7808LayoutLMv2BASE
Semantic entity labelingFUNSDF184.2LayoutLMv2LARGE
Semantic entity labelingFUNSDF182.76LayoutLMv2BASE
Key Information ExtractionCORDF196.01LayoutLMv2LARGE
Key Information ExtractionCORDF194.95LayoutLMv2BASE
Key Information ExtractionKleister NDAF185.2LayoutLMv2LARGE
Key Information ExtractionKleister NDAF183.3LayoutLMv2BASE
Key Information ExtractionSROIEF197.81LayoutLMv2LARGE (Excluding OCR mismatch)
Key Information ExtractionSROIEF196.61LayoutLMv2LARGE
Key Information ExtractionSROIEF196.25LayoutLMv2BASE
Key Information ExtractionRFUND-ENkey-value pair F149.06LayoutLMv2_base

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Assay2Mol: large language model-based drug design using BioAssay context2025-07-16