End-to-end Document Recognition and Understanding with Dessurt

Brian Davis, Bryan Morse, Bryan Price, Chris Tensmeyer, Curtis Wigington, Vlad Morariu

2022-03-30document understanding Visual Question Answering (VQA)

Abstract

We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to the document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	DocVQA test	ANLS	0.632	Dessurt

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14 Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09 PaddleOCR 3.0 Technical Report2025-07-08 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning2025-07-01