TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Unifying Vision, Text, and Layout for Universal Document P...

Unifying Vision, Text, and Layout for Universal Document Processing

Zineng Tang, ZiYi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal

2022-12-05CVPR 2023 1document understandingImage ReconstructionDocument AIVisual Question Answering (VQA)
PaperPDFCodeCodeCodeCode(official)Code(official)

Abstract

We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)DocVQA testANLS0.878UDOP (aux)
Visual Question Answering (VQA)DocVQA testANLS0.847UDOP
Visual Question Answering (VQA)InfographicVQAANLS63UDOP (aux)
Visual Question Answering (VQA)InfographicVQAANLS47.4UDOP

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16The model is the message: Lightweight convolutional autoencoders applied to noisy imaging data for planetary science and astrobiology2025-07-153D Magnetic Inverse Routine for Single-Segment Magnetic Field Images2025-07-15A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization2025-07-14Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation2025-07-11