Unified Pretraining Framework for Document Understanding

Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Nikolaos Barmpalios, Rajiv Jain, Ani Nenkova, Tong Sun

2022-04-22Document Layout Analysis document understanding Self-Supervised Learning

Abstract

Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with self-supervised objectives. However, most of the existing document pretraining methods are still language-dominated. We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. Each input element is composed of words and visual features from a semantic region of the input document image. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses, encouraging the representation to model sentences, learn similarities, and align modalities. Extensive empirical analysis demonstrates that the pretraining procedure learns better joint representations and leads to improvements in downstream tasks.

Results

Task	Dataset	Metric	Value	Model
Document Layout Analysis	PubLayNet val	Figure	0.964	UDoc
Document Layout Analysis	PubLayNet val	List	0.937	UDoc
Document Layout Analysis	PubLayNet val	Overall	0.939	UDoc
Document Layout Analysis	PubLayNet val	Table	0.973	UDoc
Document Layout Analysis	PubLayNet val	Text	0.939	UDoc
Document Layout Analysis	PubLayNet val	Title	0.885	UDoc

Related Papers

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14 PaddleOCR 3.0 Technical Report2025-07-08 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning2025-07-01 World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01 ShapeEmbed: a self-supervised learning framework for 2D contour quantification2025-07-01