GRiT: A Generative Region-to-text Transformer for Object Understanding

Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang

2022-12-01Descriptive object-detection Dense Captioning Object Detection

Abstract

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT

Results

Task	Dataset	Metric	Value	Model
Object Detection	COCO test-dev	box mAP	60.4	GRiT (ViT-H, single-scale testing)
Object Detection	COCO-O	Average mAP	42.9	GRiT (ViT-H)
Object Detection	COCO-O	Effective Robustness	15.72	GRiT (ViT-H)
3D	COCO test-dev	box mAP	60.4	GRiT (ViT-H, single-scale testing)
3D	COCO-O	Average mAP	42.9	GRiT (ViT-H)
3D	COCO-O	Effective Robustness	15.72	GRiT (ViT-H)
2D Classification	COCO test-dev	box mAP	60.4	GRiT (ViT-H, single-scale testing)
2D Classification	COCO-O	Average mAP	42.9	GRiT (ViT-H)
2D Classification	COCO-O	Effective Robustness	15.72	GRiT (ViT-H)
2D Object Detection	COCO test-dev	box mAP	60.4	GRiT (ViT-H, single-scale testing)
2D Object Detection	COCO-O	Average mAP	42.9	GRiT (ViT-H)
2D Object Detection	COCO-O	Effective Robustness	15.72	GRiT (ViT-H)
Dense Captioning	Visual Genome	mAP	15.5	GRiT (ViT-B)
16k	COCO test-dev	box mAP	60.4	GRiT (ViT-H, single-scale testing)
16k	COCO-O	Average mAP	42.9	GRiT (ViT-H)
16k	COCO-O	Effective Robustness	15.72	GRiT (ViT-H)

GRiT: A Generative Region-to-text Transformer for Object Understanding

Abstract

Results

Related Papers

GRiT: A Generative Region-to-text Transformer for Object Understanding

Abstract

Results

Related Papers