General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai

2023-12-14CVPR 2024 1Zero-shot Generalization Long-tail Video Object Segmentation Referring Expression Comprehension Referring Video Object Segmentation Multi-Object Tracking Referring Expression Segmentation Video Object Segmentation Instance Segmentation Video Instance Segmentation Object Detection Open-World Instance Segmentation

Paper PDF Code(official)

Abstract

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

Results

Task	Dataset	Metric	Value	Model
Video	Refer-YouTube-VOS	F	72.9	GLEE-Pro
Video	Refer-YouTube-VOS	J	68.2	GLEE-Pro
Video	Refer-YouTube-VOS	J&F	70.6	GLEE-Pro
Video	Refer-YouTube-VOS	F	69.7	GLEE-Plus
Video	Refer-YouTube-VOS	J	65.6	GLEE-Plus
Video	Refer-YouTube-VOS	J&F	67.7	GLEE-Plus
Video	BURST-val	HOTA (all)	31.2	GLEE-Pro
Video	BURST-val	HOTA (com)	48.7	GLEE-Pro
Video	BURST-val	HOTA (unc)	26.9	GLEE-Pro
Video	BURST-val	mAP (all)	19.2	GLEE-Pro
Video	BURST-val	mAP (com)	24.8	GLEE-Pro
Video	BURST-val	mAP (unc)	17.7	GLEE-Pro
Video	BURST-val	HOTA (all)	26.9	GLEE-Plus
Video	BURST-val	HOTA (com)	38.8	GLEE-Plus
Video	BURST-val	HOTA (unc)	23.9	GLEE-Plus
Video	BURST-val	mAP (all)	17.2	GLEE-Plus
Video	BURST-val	mAP (com)	23.7	GLEE-Plus
Video	BURST-val	mAP (unc)	15.5	GLEE-Plus
Video	BURST-val	HOTA (all)	22.6	GLEE-Lite
Video	BURST-val	HOTA (com)	36.4	GLEE-Lite
Video	BURST-val	HOTA (unc)	19.1	GLEE-Lite
Video	BURST-val	mAP (all)	12.6	GLEE-Lite
Video	BURST-val	mAP (com)	18.9	GLEE-Lite
Video	BURST-val	mAP (unc)	11	GLEE-Lite
Video	BURST	HOTA (all)	22.6	GLEE-Lite
Video	BURST	HOTA (com)	36.4	GLEE-Lite
Video	BURST	HOTA (unc)	19.1	GLEE-Lite
Video	BURST	mAP (all)	12.6	GLEE-Lite
Video	BURST	mAP (com)	18.9	GLEE-Lite
Video	BURST	mAP (unc)	11	GLEE-Lite
Multi-Object Tracking	TAO	AssocA	46.2	GLEE-Pro
Multi-Object Tracking	TAO	ClsA	29.1	GLEE-Pro
Multi-Object Tracking	TAO	LocA	66.2	GLEE-Pro
Multi-Object Tracking	TAO	TETA	47.2	GLEE-Pro
Multi-Object Tracking	TAO	AssocA	40.9	GLEE-Plus
Multi-Object Tracking	TAO	ClsA	30.8	GLEE-Plus
Multi-Object Tracking	TAO	LocA	52.9	GLEE-Plus
Multi-Object Tracking	TAO	TETA	41.5	GLEE-Plus
Multi-Object Tracking	TAO	AssocA	39.9	GLEE-Lite
Multi-Object Tracking	TAO	ClsA	24.1	GLEE-Lite
Multi-Object Tracking	TAO	LocA	56.3	GLEE-Lite
Multi-Object Tracking	TAO	TETA	40.1	GLEE-Lite
Object Tracking	TAO	AssocA	46.2	GLEE-Pro
Object Tracking	TAO	ClsA	29.1	GLEE-Pro
Object Tracking	TAO	LocA	66.2	GLEE-Pro
Object Tracking	TAO	TETA	47.2	GLEE-Pro
Object Tracking	TAO	AssocA	40.9	GLEE-Plus
Object Tracking	TAO	ClsA	30.8	GLEE-Plus
Object Tracking	TAO	LocA	52.9	GLEE-Plus
Object Tracking	TAO	TETA	41.5	GLEE-Plus
Object Tracking	TAO	AssocA	39.9	GLEE-Lite
Object Tracking	TAO	ClsA	24.1	GLEE-Lite
Object Tracking	TAO	LocA	56.3	GLEE-Lite
Object Tracking	TAO	TETA	40.1	GLEE-Lite
Object Detection	COCO test-dev	box mAP	62.3	GLEE-Pro
Object Detection	COCO test-dev	box mAP	60.6	GLEE-Plus
Object Detection	COCO test-dev	box mAP	54.7	GLEE-Lite
Object Detection	COCO minival	box AP	62	GLEE-Pro
Object Detection	COCO minival	box AP	60.4	GLEE-Plus
Object Detection	COCO minival	box AP	55	GLEE-Lite
Object Detection	LVIS v1.0 val	box AP	55.7	GLEE-Pro
3D	COCO test-dev	box mAP	62.3	GLEE-Pro
3D	COCO test-dev	box mAP	60.6	GLEE-Plus
3D	COCO test-dev	box mAP	54.7	GLEE-Lite
3D	COCO minival	box AP	62	GLEE-Pro
3D	COCO minival	box AP	60.4	GLEE-Plus
3D	COCO minival	box AP	55	GLEE-Lite
3D	LVIS v1.0 val	box AP	55.7	GLEE-Pro
Instance Segmentation	COCO minival	mask AP	54.2	GLEE-Pro
Instance Segmentation	COCO minival	mask AP	53	GLEE-Plus
Instance Segmentation	COCO minival	mask AP	48.4	GLEE-Lite
Instance Segmentation	COCO test-dev	mask AP	54.5	GLEE-Pro
Instance Segmentation	COCO test-dev	mask AP	53.3	GLEE-Plus
Instance Segmentation	COCO test-dev	mask AP	48.3	GLEE-Lite
Instance Segmentation	LVIS v1.0 val	mask AP	49.9	GLEE-Pro
Instance Segmentation	RefCOCO	IoU	80	GLEE-Pro
Instance Segmentation	RefCoCo val	Overall IoU	80	GLEE-Pro
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	72.9	GLEE-Pro
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	68.2	GLEE-Pro
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	70.6	GLEE-Pro
Instance Segmentation	RefCOCO+ val	Overall IoU	69.6	GLEE-Pro
Instance Segmentation	RefCOCOg-val	Overall IoU	72.9	GLEE-Pro
Instance Segmentation	UVO	ARmask	72.6	GLEE-Pro
Video Object Segmentation	Refer-YouTube-VOS	F	72.9	GLEE-Pro
Video Object Segmentation	Refer-YouTube-VOS	J	68.2	GLEE-Pro
Video Object Segmentation	Refer-YouTube-VOS	J&F	70.6	GLEE-Pro
Video Object Segmentation	Refer-YouTube-VOS	F	69.7	GLEE-Plus
Video Object Segmentation	Refer-YouTube-VOS	J	65.6	GLEE-Plus
Video Object Segmentation	Refer-YouTube-VOS	J&F	67.7	GLEE-Plus
Video Object Segmentation	BURST-val	HOTA (all)	31.2	GLEE-Pro
Video Object Segmentation	BURST-val	HOTA (com)	48.7	GLEE-Pro
Video Object Segmentation	BURST-val	HOTA (unc)	26.9	GLEE-Pro
Video Object Segmentation	BURST-val	mAP (all)	19.2	GLEE-Pro
Video Object Segmentation	BURST-val	mAP (com)	24.8	GLEE-Pro
Video Object Segmentation	BURST-val	mAP (unc)	17.7	GLEE-Pro
Video Object Segmentation	BURST-val	HOTA (all)	26.9	GLEE-Plus
Video Object Segmentation	BURST-val	HOTA (com)	38.8	GLEE-Plus
Video Object Segmentation	BURST-val	HOTA (unc)	23.9	GLEE-Plus
Video Object Segmentation	BURST-val	mAP (all)	17.2	GLEE-Plus
Video Object Segmentation	BURST-val	mAP (com)	23.7	GLEE-Plus
Video Object Segmentation	BURST-val	mAP (unc)	15.5	GLEE-Plus
Video Object Segmentation	BURST-val	HOTA (all)	22.6	GLEE-Lite
Video Object Segmentation	BURST-val	HOTA (com)	36.4	GLEE-Lite
Video Object Segmentation	BURST-val	HOTA (unc)	19.1	GLEE-Lite
Video Object Segmentation	BURST-val	mAP (all)	12.6	GLEE-Lite
Video Object Segmentation	BURST-val	mAP (com)	18.9	GLEE-Lite
Video Object Segmentation	BURST-val	mAP (unc)	11	GLEE-Lite
Video Object Segmentation	BURST	HOTA (all)	22.6	GLEE-Lite
Video Object Segmentation	BURST	HOTA (com)	36.4	GLEE-Lite
Video Object Segmentation	BURST	HOTA (unc)	19.1	GLEE-Lite
Video Object Segmentation	BURST	mAP (all)	12.6	GLEE-Lite
Video Object Segmentation	BURST	mAP (com)	18.9	GLEE-Lite
Video Object Segmentation	BURST	mAP (unc)	11	GLEE-Lite
Referring Expression Segmentation	RefCOCO	IoU	80	GLEE-Pro
Referring Expression Segmentation	RefCoCo val	Overall IoU	80	GLEE-Pro
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	72.9	GLEE-Pro
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	68.2	GLEE-Pro
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	70.6	GLEE-Pro
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	69.6	GLEE-Pro
Referring Expression Segmentation	RefCOCOg-val	Overall IoU	72.9	GLEE-Pro
Video Instance Segmentation	OVIS validation	AP75	55.5	GLEE-Pro
Video Instance Segmentation	OVIS validation	mask AP	50.4	GLEE-Pro
2D Classification	COCO test-dev	box mAP	62.3	GLEE-Pro
2D Classification	COCO test-dev	box mAP	60.6	GLEE-Plus
2D Classification	COCO test-dev	box mAP	54.7	GLEE-Lite
2D Classification	COCO minival	box AP	62	GLEE-Pro
2D Classification	COCO minival	box AP	60.4	GLEE-Plus
2D Classification	COCO minival	box AP	55	GLEE-Lite
2D Classification	LVIS v1.0 val	box AP	55.7	GLEE-Pro
2D Object Detection	COCO test-dev	box mAP	62.3	GLEE-Pro
2D Object Detection	COCO test-dev	box mAP	60.6	GLEE-Plus
2D Object Detection	COCO test-dev	box mAP	54.7	GLEE-Lite
2D Object Detection	COCO minival	box AP	62	GLEE-Pro
2D Object Detection	COCO minival	box AP	60.4	GLEE-Plus
2D Object Detection	COCO minival	box AP	55	GLEE-Lite
2D Object Detection	LVIS v1.0 val	box AP	55.7	GLEE-Pro
16k	COCO test-dev	box mAP	62.3	GLEE-Pro
16k	COCO test-dev	box mAP	60.6	GLEE-Plus
16k	COCO test-dev	box mAP	54.7	GLEE-Lite
16k	COCO minival	box AP	62	GLEE-Pro
16k	COCO minival	box AP	60.4	GLEE-Plus
16k	COCO minival	box AP	55	GLEE-Lite
16k	LVIS v1.0 val	box AP	55.7	GLEE-Pro

General Object Foundation Model for Images and Videos at Scale

Abstract

Results

Related Papers

General Object Foundation Model for Images and Videos at Scale

Abstract

Results

Related Papers