TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LayoutLMv3: Pre-training for Document AI with Unified Text...

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei

2022-04-18Document Layout AnalysisQuestion AnsweringRelation ExtractionImage ClassificationRepresentation LearningSemantic entity labelingMasked Language ModelingEntity Linkingcross-modal alignmentDocument AIDocument Image ClassificationVisual Question Answering (VQA)Named Entity Recognition (NER)Key Information ExtractionKey-value Pair ExtractionLanguage ModellingVisual Question Answering
PaperPDFCodeCodeCode(official)Code

Abstract

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.

Results

TaskDatasetMetricValueModel
Entity LinkingEC-FUNSDF178.14LayoutLMv3 (large)
Entity LinkingEC-FUNSDF178.14LayoutLMv3 (large)
Entity LinkingEC-FUNSDF167.47LayoutLMv3 (base)
Entity LinkingEC-FUNSDF167.47LayoutLMv3 (base)
Named Entity Recognition (NER)FUNSD-rF178.77LayoutLMv3
Named Entity Recognition (NER)CORD-rF182.72LayoutLMv3
Document Layout AnalysisPubLayNet valFigure0.97LayoutLMv3-B
Document Layout AnalysisPubLayNet valList0.955LayoutLMv3-B
Document Layout AnalysisPubLayNet valOverall0.951LayoutLMv3-B
Document Layout AnalysisPubLayNet valTable0.979LayoutLMv3-B
Document Layout AnalysisPubLayNet valText0.945LayoutLMv3-B
Document Layout AnalysisPubLayNet valTitle0.906LayoutLMv3-B
Document AIEPHOIEAverage F199.21LayoutLMv3
Semantic entity labelingEC-FUNSDF183.88LayoutLMv3 (large)
Semantic entity labelingEC-FUNSDF183.88LayoutLMv3 (large)
Semantic entity labelingEC-FUNSDF182.3LayoutLMv3 (base)
Semantic entity labelingEC-FUNSDF182.3LayoutLMv3 (base)
Semantic entity labelingFUNSDF192.08LayoutLMv3 Large
Key Information ExtractionCORDF197.46LayoutLMv3 Large
Key Information ExtractionEPHOIEAverage F199.21LayoutLMv3
Key Information ExtractionRFUND-ENkey-value pair F157.66LayoutLMv3
Key Information ExtractionSIBRkey-value pair F173.51LayoutLMv3_base_chinese

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17