TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/General Object Foundation Model for Images and Videos at S...

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai

2023-12-14CVPR 2024 1Zero-shot GeneralizationLong-tail Video Object SegmentationReferring Expression ComprehensionReferring Video Object SegmentationMulti-Object TrackingReferring Expression SegmentationVideo Object SegmentationInstance SegmentationVideo Instance SegmentationObject DetectionOpen-World Instance Segmentation
PaperPDFCode(official)

Abstract

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

Results

TaskDatasetMetricValueModel
VideoRefer-YouTube-VOSF72.9GLEE-Pro
VideoRefer-YouTube-VOSJ68.2GLEE-Pro
VideoRefer-YouTube-VOSJ&F70.6GLEE-Pro
VideoRefer-YouTube-VOSF69.7GLEE-Plus
VideoRefer-YouTube-VOSJ65.6GLEE-Plus
VideoRefer-YouTube-VOSJ&F67.7GLEE-Plus
VideoBURST-valHOTA (all)31.2GLEE-Pro
VideoBURST-valHOTA (com)48.7GLEE-Pro
VideoBURST-valHOTA (unc)26.9GLEE-Pro
VideoBURST-valmAP (all)19.2GLEE-Pro
VideoBURST-valmAP (com)24.8GLEE-Pro
VideoBURST-valmAP (unc)17.7GLEE-Pro
VideoBURST-valHOTA (all)26.9GLEE-Plus
VideoBURST-valHOTA (com)38.8GLEE-Plus
VideoBURST-valHOTA (unc)23.9GLEE-Plus
VideoBURST-valmAP (all)17.2GLEE-Plus
VideoBURST-valmAP (com)23.7GLEE-Plus
VideoBURST-valmAP (unc)15.5GLEE-Plus
VideoBURST-valHOTA (all)22.6GLEE-Lite
VideoBURST-valHOTA (com)36.4GLEE-Lite
VideoBURST-valHOTA (unc)19.1GLEE-Lite
VideoBURST-valmAP (all)12.6GLEE-Lite
VideoBURST-valmAP (com)18.9GLEE-Lite
VideoBURST-valmAP (unc)11GLEE-Lite
VideoBURSTHOTA (all)22.6GLEE-Lite
VideoBURSTHOTA (com)36.4GLEE-Lite
VideoBURSTHOTA (unc)19.1GLEE-Lite
VideoBURSTmAP (all)12.6GLEE-Lite
VideoBURSTmAP (com)18.9GLEE-Lite
VideoBURSTmAP (unc)11GLEE-Lite
Multi-Object TrackingTAOAssocA46.2GLEE-Pro
Multi-Object TrackingTAOClsA29.1GLEE-Pro
Multi-Object TrackingTAOLocA66.2GLEE-Pro
Multi-Object TrackingTAOTETA47.2GLEE-Pro
Multi-Object TrackingTAOAssocA40.9GLEE-Plus
Multi-Object TrackingTAOClsA30.8GLEE-Plus
Multi-Object TrackingTAOLocA52.9GLEE-Plus
Multi-Object TrackingTAOTETA41.5GLEE-Plus
Multi-Object TrackingTAOAssocA39.9GLEE-Lite
Multi-Object TrackingTAOClsA24.1GLEE-Lite
Multi-Object TrackingTAOLocA56.3GLEE-Lite
Multi-Object TrackingTAOTETA40.1GLEE-Lite
Object TrackingTAOAssocA46.2GLEE-Pro
Object TrackingTAOClsA29.1GLEE-Pro
Object TrackingTAOLocA66.2GLEE-Pro
Object TrackingTAOTETA47.2GLEE-Pro
Object TrackingTAOAssocA40.9GLEE-Plus
Object TrackingTAOClsA30.8GLEE-Plus
Object TrackingTAOLocA52.9GLEE-Plus
Object TrackingTAOTETA41.5GLEE-Plus
Object TrackingTAOAssocA39.9GLEE-Lite
Object TrackingTAOClsA24.1GLEE-Lite
Object TrackingTAOLocA56.3GLEE-Lite
Object TrackingTAOTETA40.1GLEE-Lite
Object DetectionCOCO test-devbox mAP62.3GLEE-Pro
Object DetectionCOCO test-devbox mAP60.6GLEE-Plus
Object DetectionCOCO test-devbox mAP54.7GLEE-Lite
Object DetectionCOCO minivalbox AP62GLEE-Pro
Object DetectionCOCO minivalbox AP60.4GLEE-Plus
Object DetectionCOCO minivalbox AP55GLEE-Lite
Object DetectionLVIS v1.0 valbox AP55.7GLEE-Pro
3DCOCO test-devbox mAP62.3GLEE-Pro
3DCOCO test-devbox mAP60.6GLEE-Plus
3DCOCO test-devbox mAP54.7GLEE-Lite
3DCOCO minivalbox AP62GLEE-Pro
3DCOCO minivalbox AP60.4GLEE-Plus
3DCOCO minivalbox AP55GLEE-Lite
3DLVIS v1.0 valbox AP55.7GLEE-Pro
Instance SegmentationCOCO minivalmask AP54.2GLEE-Pro
Instance SegmentationCOCO minivalmask AP53GLEE-Plus
Instance SegmentationCOCO minivalmask AP48.4GLEE-Lite
Instance SegmentationCOCO test-devmask AP54.5GLEE-Pro
Instance SegmentationCOCO test-devmask AP53.3GLEE-Plus
Instance SegmentationCOCO test-devmask AP48.3GLEE-Lite
Instance SegmentationLVIS v1.0 valmask AP49.9GLEE-Pro
Instance SegmentationRefCOCOIoU80GLEE-Pro
Instance SegmentationRefCoCo valOverall IoU80GLEE-Pro
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F72.9GLEE-Pro
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J68.2GLEE-Pro
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F70.6GLEE-Pro
Instance SegmentationRefCOCO+ valOverall IoU69.6GLEE-Pro
Instance SegmentationRefCOCOg-valOverall IoU72.9GLEE-Pro
Instance SegmentationUVOARmask72.6GLEE-Pro
Video Object SegmentationRefer-YouTube-VOSF72.9GLEE-Pro
Video Object SegmentationRefer-YouTube-VOSJ68.2GLEE-Pro
Video Object SegmentationRefer-YouTube-VOSJ&F70.6GLEE-Pro
Video Object SegmentationRefer-YouTube-VOSF69.7GLEE-Plus
Video Object SegmentationRefer-YouTube-VOSJ65.6GLEE-Plus
Video Object SegmentationRefer-YouTube-VOSJ&F67.7GLEE-Plus
Video Object SegmentationBURST-valHOTA (all)31.2GLEE-Pro
Video Object SegmentationBURST-valHOTA (com)48.7GLEE-Pro
Video Object SegmentationBURST-valHOTA (unc)26.9GLEE-Pro
Video Object SegmentationBURST-valmAP (all)19.2GLEE-Pro
Video Object SegmentationBURST-valmAP (com)24.8GLEE-Pro
Video Object SegmentationBURST-valmAP (unc)17.7GLEE-Pro
Video Object SegmentationBURST-valHOTA (all)26.9GLEE-Plus
Video Object SegmentationBURST-valHOTA (com)38.8GLEE-Plus
Video Object SegmentationBURST-valHOTA (unc)23.9GLEE-Plus
Video Object SegmentationBURST-valmAP (all)17.2GLEE-Plus
Video Object SegmentationBURST-valmAP (com)23.7GLEE-Plus
Video Object SegmentationBURST-valmAP (unc)15.5GLEE-Plus
Video Object SegmentationBURST-valHOTA (all)22.6GLEE-Lite
Video Object SegmentationBURST-valHOTA (com)36.4GLEE-Lite
Video Object SegmentationBURST-valHOTA (unc)19.1GLEE-Lite
Video Object SegmentationBURST-valmAP (all)12.6GLEE-Lite
Video Object SegmentationBURST-valmAP (com)18.9GLEE-Lite
Video Object SegmentationBURST-valmAP (unc)11GLEE-Lite
Video Object SegmentationBURSTHOTA (all)22.6GLEE-Lite
Video Object SegmentationBURSTHOTA (com)36.4GLEE-Lite
Video Object SegmentationBURSTHOTA (unc)19.1GLEE-Lite
Video Object SegmentationBURSTmAP (all)12.6GLEE-Lite
Video Object SegmentationBURSTmAP (com)18.9GLEE-Lite
Video Object SegmentationBURSTmAP (unc)11GLEE-Lite
Referring Expression SegmentationRefCOCOIoU80GLEE-Pro
Referring Expression SegmentationRefCoCo valOverall IoU80GLEE-Pro
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F72.9GLEE-Pro
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J68.2GLEE-Pro
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F70.6GLEE-Pro
Referring Expression SegmentationRefCOCO+ valOverall IoU69.6GLEE-Pro
Referring Expression SegmentationRefCOCOg-valOverall IoU72.9GLEE-Pro
Video Instance SegmentationOVIS validationAP7555.5GLEE-Pro
Video Instance SegmentationOVIS validationmask AP50.4GLEE-Pro
2D ClassificationCOCO test-devbox mAP62.3GLEE-Pro
2D ClassificationCOCO test-devbox mAP60.6GLEE-Plus
2D ClassificationCOCO test-devbox mAP54.7GLEE-Lite
2D ClassificationCOCO minivalbox AP62GLEE-Pro
2D ClassificationCOCO minivalbox AP60.4GLEE-Plus
2D ClassificationCOCO minivalbox AP55GLEE-Lite
2D ClassificationLVIS v1.0 valbox AP55.7GLEE-Pro
2D Object DetectionCOCO test-devbox mAP62.3GLEE-Pro
2D Object DetectionCOCO test-devbox mAP60.6GLEE-Plus
2D Object DetectionCOCO test-devbox mAP54.7GLEE-Lite
2D Object DetectionCOCO minivalbox AP62GLEE-Pro
2D Object DetectionCOCO minivalbox AP60.4GLEE-Plus
2D Object DetectionCOCO minivalbox AP55GLEE-Lite
2D Object DetectionLVIS v1.0 valbox AP55.7GLEE-Pro
16kCOCO test-devbox mAP62.3GLEE-Pro
16kCOCO test-devbox mAP60.6GLEE-Plus
16kCOCO test-devbox mAP54.7GLEE-Lite
16kCOCO minivalbox AP62GLEE-Pro
16kCOCO minivalbox AP60.4GLEE-Plus
16kCOCO minivalbox AP55GLEE-Lite
16kLVIS v1.0 valbox AP55.7GLEE-Pro

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16