TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DUBLIN -- Document Understanding By Language-Image Network

DUBLIN -- Document Understanding By Language-Image Network

Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som, Vishrav Chaudhary, Saurabh Tiwary

2023-05-23Reading ComprehensionFeature EngineeringQuestion AnsweringText Generationdocument understandingDocument ClassificationVisual Question Answering (VQA)Key Information ExtractionVisual Question AnsweringOptical Character Recognition (OCR)
PaperPDF

Abstract

Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Text Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve competitive performance on RVL-CDIP document classification. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)AI2DEM51.11DUBLIN
Visual Question Answering (VQA)DocVQA testANLS0.803DUBLIN (variable resolution)
Visual Question Answering (VQA)DocVQA testANLS0.782DUBLIN
Visual Question Answering (VQA)WebSRCEM77.75DUBLIN
Visual Question Answering (VQA)InfographicVQAANLS42.6DUBLIN (variable resolution)
Visual Question Answering (VQA)InfographicVQAANLS36.82DUBLIN
Visual Question Answering (VQA)DeepFormF162.23DUBLIN

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16