TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Modeling Layout Reading Order as Ordering Relations for Vi...

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

Chong Zhang, Yi Tu, Yixi Zhao, Chenshu Yuan, Huan Chen, Yue Zhang, Mingxu Chai, Ya Guo, Huijia Zhu, Qi Zhang, Tao Gui

2024-09-29Relation Extractiondocument understandingSemantic entity labelingEntity LinkingReading Order DetectionKey Information Extraction
PaperPDFCode(official)

Abstract

Modeling and leveraging layout reading order in visually-rich documents (VrDs) is critical in document intelligence as it captures the rich structure semantics within documents. Previous works typically formulated layout reading order as a permutation of layout elements, i.e. a sequence containing all the layout elements. However, we argue that this formulation does not adequately convey the complete reading order information in the layout, which may potentially lead to performance decline in downstream VrD tasks. To address this issue, we propose to model the layout reading order as ordering relations over the set of layout elements, which have sufficient expressive capability for the complete reading order information. To enable empirical evaluation on methods towards the improved form of reading order prediction (ROP), we establish a comprehensive benchmark dataset including the reading order annotation as relations over layout elements, together with a relation-extraction-based method that outperforms previous methods. Moreover, to highlight the practical benefits of introducing the improved form of layout reading order, we propose a reading-order-relation-enhancing pipeline to improve model performance on any arbitrary VrD task by introducing additional reading order relation inputs. Comprehensive results demonstrate that the pipeline generally benefits downstream VrD tasks: (1) with utilizing the reading order relation information, the enhanced downstream models achieve SOTA results on both two task settings of the targeted dataset; (2) with utilizing the pseudo reading order information generated by the proposed ROP model, the performance of the enhanced models has improved across all three models and eight cross-domain VrD-IE/QA task settings without targeted optimization.

Results

TaskDatasetMetricValueModel
Relation ExtractionFUNSDF188.46RORE (GeoLayoutLM)
Entity LinkingEC-FUNSDF187.42RORE (GeoLayoutLM)
Entity LinkingEC-FUNSDF179.33RORE (LayoutLMv3-large)
Entity LinkingEC-FUNSDF173.64RORE (LayoutLMv3-base)
Entity LinkingFUNSDF188.46RORE (GeoLayoutLM)
Semantic entity labelingEC-FUNSDF184.53RORE (LayoutLMv3-large)
Semantic entity labelingEC-FUNSDF184.34RORE (GeoLayoutLM)
Semantic entity labelingEC-FUNSDF182.8RORE (LayoutLMv3-base)
Semantic entity labelingFUNSDF191.84RORE (GeoLayoutLM)
Key Information ExtractionCORDF198.52RORE (GeoLayoutLM)
Key Information ExtractionSROIEF196.97RORE (GeoLayoutLM)
Reading Order DetectionROORSegment-level F182.38LayoutLMv3-GlobalPointer (large)
Reading Order DetectionROORSegment-level F168.6LayoutLMv3-GlobalPointer (base)

Related Papers

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations2025-07-08PaddleOCR 3.0 Technical Report2025-07-08GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning2025-07-01Class-Agnostic Region-of-Interest Matching in Document Images2025-06-26DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images2025-06-26Multiple Streams of Relation Extraction: Enriching and Recalling in Transformers2025-06-25Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models2025-06-25