Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Refer-YouTube-VOS | F | 72.9 | GLEE-Pro |
| Video | Refer-YouTube-VOS | J | 68.2 | GLEE-Pro |
| Video | Refer-YouTube-VOS | J&F | 70.6 | GLEE-Pro |
| Video | Refer-YouTube-VOS | F | 69.7 | GLEE-Plus |
| Video | Refer-YouTube-VOS | J | 65.6 | GLEE-Plus |
| Video | Refer-YouTube-VOS | J&F | 67.7 | GLEE-Plus |
| Video | BURST-val | HOTA (all) | 31.2 | GLEE-Pro |
| Video | BURST-val | HOTA (com) | 48.7 | GLEE-Pro |
| Video | BURST-val | HOTA (unc) | 26.9 | GLEE-Pro |
| Video | BURST-val | mAP (all) | 19.2 | GLEE-Pro |
| Video | BURST-val | mAP (com) | 24.8 | GLEE-Pro |
| Video | BURST-val | mAP (unc) | 17.7 | GLEE-Pro |
| Video | BURST-val | HOTA (all) | 26.9 | GLEE-Plus |
| Video | BURST-val | HOTA (com) | 38.8 | GLEE-Plus |
| Video | BURST-val | HOTA (unc) | 23.9 | GLEE-Plus |
| Video | BURST-val | mAP (all) | 17.2 | GLEE-Plus |
| Video | BURST-val | mAP (com) | 23.7 | GLEE-Plus |
| Video | BURST-val | mAP (unc) | 15.5 | GLEE-Plus |
| Video | BURST-val | HOTA (all) | 22.6 | GLEE-Lite |
| Video | BURST-val | HOTA (com) | 36.4 | GLEE-Lite |
| Video | BURST-val | HOTA (unc) | 19.1 | GLEE-Lite |
| Video | BURST-val | mAP (all) | 12.6 | GLEE-Lite |
| Video | BURST-val | mAP (com) | 18.9 | GLEE-Lite |
| Video | BURST-val | mAP (unc) | 11 | GLEE-Lite |
| Video | BURST | HOTA (all) | 22.6 | GLEE-Lite |
| Video | BURST | HOTA (com) | 36.4 | GLEE-Lite |
| Video | BURST | HOTA (unc) | 19.1 | GLEE-Lite |
| Video | BURST | mAP (all) | 12.6 | GLEE-Lite |
| Video | BURST | mAP (com) | 18.9 | GLEE-Lite |
| Video | BURST | mAP (unc) | 11 | GLEE-Lite |
| Multi-Object Tracking | TAO | AssocA | 46.2 | GLEE-Pro |
| Multi-Object Tracking | TAO | ClsA | 29.1 | GLEE-Pro |
| Multi-Object Tracking | TAO | LocA | 66.2 | GLEE-Pro |
| Multi-Object Tracking | TAO | TETA | 47.2 | GLEE-Pro |
| Multi-Object Tracking | TAO | AssocA | 40.9 | GLEE-Plus |
| Multi-Object Tracking | TAO | ClsA | 30.8 | GLEE-Plus |
| Multi-Object Tracking | TAO | LocA | 52.9 | GLEE-Plus |
| Multi-Object Tracking | TAO | TETA | 41.5 | GLEE-Plus |
| Multi-Object Tracking | TAO | AssocA | 39.9 | GLEE-Lite |
| Multi-Object Tracking | TAO | ClsA | 24.1 | GLEE-Lite |
| Multi-Object Tracking | TAO | LocA | 56.3 | GLEE-Lite |
| Multi-Object Tracking | TAO | TETA | 40.1 | GLEE-Lite |
| Object Tracking | TAO | AssocA | 46.2 | GLEE-Pro |
| Object Tracking | TAO | ClsA | 29.1 | GLEE-Pro |
| Object Tracking | TAO | LocA | 66.2 | GLEE-Pro |
| Object Tracking | TAO | TETA | 47.2 | GLEE-Pro |
| Object Tracking | TAO | AssocA | 40.9 | GLEE-Plus |
| Object Tracking | TAO | ClsA | 30.8 | GLEE-Plus |
| Object Tracking | TAO | LocA | 52.9 | GLEE-Plus |
| Object Tracking | TAO | TETA | 41.5 | GLEE-Plus |
| Object Tracking | TAO | AssocA | 39.9 | GLEE-Lite |
| Object Tracking | TAO | ClsA | 24.1 | GLEE-Lite |
| Object Tracking | TAO | LocA | 56.3 | GLEE-Lite |
| Object Tracking | TAO | TETA | 40.1 | GLEE-Lite |
| Object Detection | COCO test-dev | box mAP | 62.3 | GLEE-Pro |
| Object Detection | COCO test-dev | box mAP | 60.6 | GLEE-Plus |
| Object Detection | COCO test-dev | box mAP | 54.7 | GLEE-Lite |
| Object Detection | COCO minival | box AP | 62 | GLEE-Pro |
| Object Detection | COCO minival | box AP | 60.4 | GLEE-Plus |
| Object Detection | COCO minival | box AP | 55 | GLEE-Lite |
| Object Detection | LVIS v1.0 val | box AP | 55.7 | GLEE-Pro |
| 3D | COCO test-dev | box mAP | 62.3 | GLEE-Pro |
| 3D | COCO test-dev | box mAP | 60.6 | GLEE-Plus |
| 3D | COCO test-dev | box mAP | 54.7 | GLEE-Lite |
| 3D | COCO minival | box AP | 62 | GLEE-Pro |
| 3D | COCO minival | box AP | 60.4 | GLEE-Plus |
| 3D | COCO minival | box AP | 55 | GLEE-Lite |
| 3D | LVIS v1.0 val | box AP | 55.7 | GLEE-Pro |
| Instance Segmentation | COCO minival | mask AP | 54.2 | GLEE-Pro |
| Instance Segmentation | COCO minival | mask AP | 53 | GLEE-Plus |
| Instance Segmentation | COCO minival | mask AP | 48.4 | GLEE-Lite |
| Instance Segmentation | COCO test-dev | mask AP | 54.5 | GLEE-Pro |
| Instance Segmentation | COCO test-dev | mask AP | 53.3 | GLEE-Plus |
| Instance Segmentation | COCO test-dev | mask AP | 48.3 | GLEE-Lite |
| Instance Segmentation | LVIS v1.0 val | mask AP | 49.9 | GLEE-Pro |
| Instance Segmentation | RefCOCO | IoU | 80 | GLEE-Pro |
| Instance Segmentation | RefCoCo val | Overall IoU | 80 | GLEE-Pro |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 72.9 | GLEE-Pro |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 68.2 | GLEE-Pro |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 70.6 | GLEE-Pro |
| Instance Segmentation | RefCOCO+ val | Overall IoU | 69.6 | GLEE-Pro |
| Instance Segmentation | RefCOCOg-val | Overall IoU | 72.9 | GLEE-Pro |
| Instance Segmentation | UVO | ARmask | 72.6 | GLEE-Pro |
| Video Object Segmentation | Refer-YouTube-VOS | F | 72.9 | GLEE-Pro |
| Video Object Segmentation | Refer-YouTube-VOS | J | 68.2 | GLEE-Pro |
| Video Object Segmentation | Refer-YouTube-VOS | J&F | 70.6 | GLEE-Pro |
| Video Object Segmentation | Refer-YouTube-VOS | F | 69.7 | GLEE-Plus |
| Video Object Segmentation | Refer-YouTube-VOS | J | 65.6 | GLEE-Plus |
| Video Object Segmentation | Refer-YouTube-VOS | J&F | 67.7 | GLEE-Plus |
| Video Object Segmentation | BURST-val | HOTA (all) | 31.2 | GLEE-Pro |
| Video Object Segmentation | BURST-val | HOTA (com) | 48.7 | GLEE-Pro |
| Video Object Segmentation | BURST-val | HOTA (unc) | 26.9 | GLEE-Pro |
| Video Object Segmentation | BURST-val | mAP (all) | 19.2 | GLEE-Pro |
| Video Object Segmentation | BURST-val | mAP (com) | 24.8 | GLEE-Pro |
| Video Object Segmentation | BURST-val | mAP (unc) | 17.7 | GLEE-Pro |
| Video Object Segmentation | BURST-val | HOTA (all) | 26.9 | GLEE-Plus |
| Video Object Segmentation | BURST-val | HOTA (com) | 38.8 | GLEE-Plus |
| Video Object Segmentation | BURST-val | HOTA (unc) | 23.9 | GLEE-Plus |
| Video Object Segmentation | BURST-val | mAP (all) | 17.2 | GLEE-Plus |
| Video Object Segmentation | BURST-val | mAP (com) | 23.7 | GLEE-Plus |
| Video Object Segmentation | BURST-val | mAP (unc) | 15.5 | GLEE-Plus |
| Video Object Segmentation | BURST-val | HOTA (all) | 22.6 | GLEE-Lite |
| Video Object Segmentation | BURST-val | HOTA (com) | 36.4 | GLEE-Lite |
| Video Object Segmentation | BURST-val | HOTA (unc) | 19.1 | GLEE-Lite |
| Video Object Segmentation | BURST-val | mAP (all) | 12.6 | GLEE-Lite |
| Video Object Segmentation | BURST-val | mAP (com) | 18.9 | GLEE-Lite |
| Video Object Segmentation | BURST-val | mAP (unc) | 11 | GLEE-Lite |
| Video Object Segmentation | BURST | HOTA (all) | 22.6 | GLEE-Lite |
| Video Object Segmentation | BURST | HOTA (com) | 36.4 | GLEE-Lite |
| Video Object Segmentation | BURST | HOTA (unc) | 19.1 | GLEE-Lite |
| Video Object Segmentation | BURST | mAP (all) | 12.6 | GLEE-Lite |
| Video Object Segmentation | BURST | mAP (com) | 18.9 | GLEE-Lite |
| Video Object Segmentation | BURST | mAP (unc) | 11 | GLEE-Lite |
| Referring Expression Segmentation | RefCOCO | IoU | 80 | GLEE-Pro |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 80 | GLEE-Pro |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 72.9 | GLEE-Pro |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 68.2 | GLEE-Pro |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 70.6 | GLEE-Pro |
| Referring Expression Segmentation | RefCOCO+ val | Overall IoU | 69.6 | GLEE-Pro |
| Referring Expression Segmentation | RefCOCOg-val | Overall IoU | 72.9 | GLEE-Pro |
| Video Instance Segmentation | OVIS validation | AP75 | 55.5 | GLEE-Pro |
| Video Instance Segmentation | OVIS validation | mask AP | 50.4 | GLEE-Pro |
| 2D Classification | COCO test-dev | box mAP | 62.3 | GLEE-Pro |
| 2D Classification | COCO test-dev | box mAP | 60.6 | GLEE-Plus |
| 2D Classification | COCO test-dev | box mAP | 54.7 | GLEE-Lite |
| 2D Classification | COCO minival | box AP | 62 | GLEE-Pro |
| 2D Classification | COCO minival | box AP | 60.4 | GLEE-Plus |
| 2D Classification | COCO minival | box AP | 55 | GLEE-Lite |
| 2D Classification | LVIS v1.0 val | box AP | 55.7 | GLEE-Pro |
| 2D Object Detection | COCO test-dev | box mAP | 62.3 | GLEE-Pro |
| 2D Object Detection | COCO test-dev | box mAP | 60.6 | GLEE-Plus |
| 2D Object Detection | COCO test-dev | box mAP | 54.7 | GLEE-Lite |
| 2D Object Detection | COCO minival | box AP | 62 | GLEE-Pro |
| 2D Object Detection | COCO minival | box AP | 60.4 | GLEE-Plus |
| 2D Object Detection | COCO minival | box AP | 55 | GLEE-Lite |
| 2D Object Detection | LVIS v1.0 val | box AP | 55.7 | GLEE-Pro |
| 16k | COCO test-dev | box mAP | 62.3 | GLEE-Pro |
| 16k | COCO test-dev | box mAP | 60.6 | GLEE-Plus |
| 16k | COCO test-dev | box mAP | 54.7 | GLEE-Lite |
| 16k | COCO minival | box AP | 62 | GLEE-Pro |
| 16k | COCO minival | box AP | 60.4 | GLEE-Plus |
| 16k | COCO minival | box AP | 55 | GLEE-Lite |
| 16k | LVIS v1.0 val | box AP | 55.7 | GLEE-Pro |