Generalized Decoding for Pixel, Image, and Language

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, JianFeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao

2022-12-21CVPR 2023 1Panoptic Segmentation Zero Shot Segmentation Referring Expression Segmentation Segmentation Semantic Segmentation Instance Segmentation Image Segmentation

Paper PDF Code(official)

Abstract

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K val	AP	38.7	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Semantic Segmentation	ADE20K val	PQ	52.4	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Semantic Segmentation	ADE20K val	mIoU	59.1	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Semantic Segmentation	ADE20K val	AP	35.8	X-Decoder (L)
Semantic Segmentation	ADE20K val	PQ	49.6	X-Decoder (L)
Semantic Segmentation	ADE20K val	mIoU	58.1	X-Decoder (L)
Instance Segmentation	ADE20K val	AP	38.7	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Instance Segmentation	ADE20K val	APL	59.6	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Instance Segmentation	ADE20K val	APM	43.3	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Instance Segmentation	ADE20K val	APS	18.9	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Instance Segmentation	ADE20K val	AP	35.8	X-Decoder (L)
Instance Segmentation	RefCOCOg-val	Overall IoU	64.6	X-Decoder (Davit-d5)
Zero Shot Segmentation	Segmentation in the Wild	Mean AP	32.2	SGinW_Team (X-Decoder-L)
Zero Shot Segmentation	Segmentation in the Wild	Mean AP	27.7	SGinW_Team (X-Decoder-B)
Zero Shot Segmentation	Segmentation in the Wild	Mean AP	26.6	SGinW_Team (X-Decoder-L-IN21K)
Zero Shot Segmentation	Segmentation in the Wild	Mean AP	22.6	SGinW_Team (X-Decoder-T)
Referring Expression Segmentation	RefCOCOg-val	Overall IoU	64.6	X-Decoder (Davit-d5)
10-shot image generation	ADE20K val	AP	38.7	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
10-shot image generation	ADE20K val	PQ	52.4	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
10-shot image generation	ADE20K val	mIoU	59.1	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
10-shot image generation	ADE20K val	AP	35.8	X-Decoder (L)
10-shot image generation	ADE20K val	PQ	49.6	X-Decoder (L)
10-shot image generation	ADE20K val	mIoU	58.1	X-Decoder (L)
Panoptic Segmentation	ADE20K val	AP	38.7	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Panoptic Segmentation	ADE20K val	PQ	52.4	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Panoptic Segmentation	ADE20K val	mIoU	59.1	X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)
Panoptic Segmentation	ADE20K val	AP	35.8	X-Decoder (L)
Panoptic Segmentation	ADE20K val	PQ	49.6	X-Decoder (L)
Panoptic Segmentation	ADE20K val	mIoU	58.1	X-Decoder (L)

Generalized Decoding for Pixel, Image, and Language

Abstract

Results

Related Papers

Generalized Decoding for Pixel, Image, and Language

Abstract

Results

Related Papers