Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka

2021-02-18document understanding Document Image Classification Visual Question Answering (VQA)Visual Question Answering

Abstract

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	DocVQA test	ANLS	0.8705	TILT-Large
Visual Question Answering (VQA)	DocVQA test	ANLS	0.8392	TILT-Base
Visual Question Answering (VQA)	InfographicVQA	ANLS	61.2	TILT-Large

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14 Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09 Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09