Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi

2022-06-17Question Answering Referring Expression Object Categorization Referring Expression Comprehension Surface Normal Estimation Object Localization Pose Estimation Depth Estimation Image Generation Visual Question Answering (VQA)object-detection Object Detection Object Segmentation Keypoint Estimation

Paper PDF

Abstract

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for Unified-IO are available at: https://unified-io.allenai.org.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	GRIT	VQA (ablation)	74.5	Unified-IOXL
Visual Question Answering (VQA)	GRIT	VQA (test)	74.5	Unified-IOXL
Object Localization	GRIT	Localization (ablation)	67	Unified-IOXL
Object Localization	GRIT	Localization (test)	67.1	Unified-IOXL
Object Segmentation	GRIT	Segmentation (ablation)	56.3	Unified-IOXL
Object Segmentation	GRIT	Segmentation (test)	56.5	Unified-IOXL
Object Categorization	GRIT	Categorization (ablation)	61.7	Unified-IOXL
Object Categorization	GRIT	Categorization (test)	60.8	Unified-IOXL

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Abstract

Results

Related Papers

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Abstract

Results

Related Papers