Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, Huchuan Lu
All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at https://github.com/MasterBin-IIAU/UNINEXT.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | BDD100K val | mIDF1 | 56.7 | UNINEXT-H |
| Video | BDD100K val | mMOTA | 44.2 | UNINEXT-H |
| Visual Tracking | TNL2K | AUC | 59.3 | UNINEXT-H |
| Visual Tracking | TNL2K | precision | 62.8 | UNINEXT-H |
| Object Tracking | LaSOT | AUC | 72.4 | UNINEXT-L |
| Object Tracking | LaSOT | Normalized Precision | 80.7 | UNINEXT-L |
| Object Tracking | LaSOT | Precision | 78.9 | UNINEXT-L |
| Object Tracking | LaSOT | AUC | 72.2 | UNINEXT-H |
| Object Tracking | LaSOT | Normalized Precision | 80.8 | UNINEXT-H |
| Object Tracking | LaSOT | Precision | 79.4 | UNINEXT-H |
| Object Tracking | LaSOT-ext | AUC | 56.2 | UNINEXT-H |
| Object Tracking | LaSOT-ext | Normalized Precision | 63.8 | UNINEXT-H |
| Object Tracking | LaSOT-ext | Precision | 63.8 | UNINEXT-H |
| Object Tracking | TrackingNet | Accuracy | 85.4 | UNINEXT-H |
| Object Tracking | TrackingNet | Normalized Precision | 89 | UNINEXT-H |
| Object Tracking | TrackingNet | Precision | 86.4 | UNINEXT-H |
| Object Tracking | BDD100K val | mIDF1 | 56.7 | UNINEXT-H |
| Object Tracking | BDD100K val | mMOTA | 44.2 | UNINEXT-H |
| Object Detection | COCO minival | AP50 | 77.5 | UNINEXT-H |
| Object Detection | COCO minival | AP75 | 66.7 | UNINEXT-H |
| Object Detection | COCO minival | APL | 75.3 | UNINEXT-H |
| Object Detection | COCO minival | APM | 64.8 | UNINEXT-H |
| Object Detection | COCO minival | APS | 45.1 | UNINEXT-H |
| Object Detection | COCO minival | box AP | 60.6 | UNINEXT-H |
| Object Detection | Description Detection Dataset | Intra-scenario ABS mAP | 15.9 | UNINEXT-large |
| Object Detection | Description Detection Dataset | Intra-scenario FULL mAP | 17.9 | UNINEXT-large |
| Object Detection | Description Detection Dataset | Intra-scenario PRES mAP | 18.6 | UNINEXT-large |
| 3D | COCO minival | AP50 | 77.5 | UNINEXT-H |
| 3D | COCO minival | AP75 | 66.7 | UNINEXT-H |
| 3D | COCO minival | APL | 75.3 | UNINEXT-H |
| 3D | COCO minival | APM | 64.8 | UNINEXT-H |
| 3D | COCO minival | APS | 45.1 | UNINEXT-H |
| 3D | COCO minival | box AP | 60.6 | UNINEXT-H |
| 3D | Description Detection Dataset | Intra-scenario ABS mAP | 15.9 | UNINEXT-large |
| 3D | Description Detection Dataset | Intra-scenario FULL mAP | 17.9 | UNINEXT-large |
| 3D | Description Detection Dataset | Intra-scenario PRES mAP | 18.6 | UNINEXT-large |
| Instance Segmentation | COCO test-dev | AP50 | 76.2 | UNINEXT-H |
| Instance Segmentation | COCO test-dev | AP75 | 56.7 | UNINEXT-H |
| Instance Segmentation | COCO test-dev | APL | 67.5 | UNINEXT-H |
| Instance Segmentation | COCO test-dev | APM | 55.9 | UNINEXT-H |
| Instance Segmentation | COCO test-dev | APS | 33.3 | UNINEXT-H |
| Instance Segmentation | COCO test-dev | mask AP | 51.8 | UNINEXT-H |
| Instance Segmentation | RefCoCo val | Overall IoU | 82.19 | UNINEXT-H |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 72.7 | UNINEXT-H |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 67.6 | UNINEXT-H |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 70.1 | UNINEXT-H |
| Instance Segmentation | RefCOCO+ val | Overall IoU | 72.47 | UNINEXT-H |
| Instance Segmentation | RefCOCO+ test B | Overall IoU | 66.22 | UNINEXT-H |
| Instance Segmentation | DAVIS 2017 (val) | J&F 1st frame | 72.5 | UNINEXT-H |
| Instance Segmentation | RefCOCO+ testA | Overall IoU | 76.42 | UNINEXT-H |
| Zero Shot Segmentation | Segmentation in the Wild | Mean AP | 42.1 | UNINEXT |
| Referring Expression Segmentation | RefCoCo val | Overall IoU | 82.19 | UNINEXT-H |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | F | 72.7 | UNINEXT-H |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J | 67.6 | UNINEXT-H |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 70.1 | UNINEXT-H |
| Referring Expression Segmentation | RefCOCO+ val | Overall IoU | 72.47 | UNINEXT-H |
| Referring Expression Segmentation | RefCOCO+ test B | Overall IoU | 66.22 | UNINEXT-H |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F 1st frame | 72.5 | UNINEXT-H |
| Referring Expression Segmentation | RefCOCO+ testA | Overall IoU | 76.42 | UNINEXT-H |
| Video Instance Segmentation | OVIS validation | AP50 | 72.5 | UNINEXT (ViT-H, Online) |
| Video Instance Segmentation | OVIS validation | AP75 | 52.2 | UNINEXT (ViT-H, Online) |
| Video Instance Segmentation | OVIS validation | mask AP | 49 | UNINEXT (ViT-H, Online) |
| Video Instance Segmentation | OVIS validation | AP50 | 55.5 | UNINEXT (ResNet-50, Online) |
| Video Instance Segmentation | OVIS validation | AP75 | 35.6 | UNINEXT (ResNet-50, Online) |
| Video Instance Segmentation | OVIS validation | mask AP | 34 | UNINEXT (ResNet-50, Online) |
| Multi-Object Tracking and Segmentation | BDD100K val | mMOTSA | 35.7 | UNINEXT-H |
| Multiple Object Tracking | BDD100K val | mIDF1 | 56.7 | UNINEXT-H |
| Multiple Object Tracking | BDD100K val | mMOTA | 44.2 | UNINEXT-H |
| 2D Classification | COCO minival | AP50 | 77.5 | UNINEXT-H |
| 2D Classification | COCO minival | AP75 | 66.7 | UNINEXT-H |
| 2D Classification | COCO minival | APL | 75.3 | UNINEXT-H |
| 2D Classification | COCO minival | APM | 64.8 | UNINEXT-H |
| 2D Classification | COCO minival | APS | 45.1 | UNINEXT-H |
| 2D Classification | COCO minival | box AP | 60.6 | UNINEXT-H |
| 2D Classification | Description Detection Dataset | Intra-scenario ABS mAP | 15.9 | UNINEXT-large |
| 2D Classification | Description Detection Dataset | Intra-scenario FULL mAP | 17.9 | UNINEXT-large |
| 2D Classification | Description Detection Dataset | Intra-scenario PRES mAP | 18.6 | UNINEXT-large |
| 2D Object Detection | COCO minival | AP50 | 77.5 | UNINEXT-H |
| 2D Object Detection | COCO minival | AP75 | 66.7 | UNINEXT-H |
| 2D Object Detection | COCO minival | APL | 75.3 | UNINEXT-H |
| 2D Object Detection | COCO minival | APM | 64.8 | UNINEXT-H |
| 2D Object Detection | COCO minival | APS | 45.1 | UNINEXT-H |
| 2D Object Detection | COCO minival | box AP | 60.6 | UNINEXT-H |
| 2D Object Detection | Description Detection Dataset | Intra-scenario ABS mAP | 15.9 | UNINEXT-large |
| 2D Object Detection | Description Detection Dataset | Intra-scenario FULL mAP | 17.9 | UNINEXT-large |
| 2D Object Detection | Description Detection Dataset | Intra-scenario PRES mAP | 18.6 | UNINEXT-large |
| Generalized Referring Expression Comprehension | gRefCOCO | N-acc. | 50.6 | UNINEXT |
| Generalized Referring Expression Comprehension | gRefCOCO | Precision@(F1=1, IoU≥0.5) | 58.2 | UNINEXT |
| Visual Object Tracking | LaSOT | AUC | 72.4 | UNINEXT-L |
| Visual Object Tracking | LaSOT | Normalized Precision | 80.7 | UNINEXT-L |
| Visual Object Tracking | LaSOT | Precision | 78.9 | UNINEXT-L |
| Visual Object Tracking | LaSOT | AUC | 72.2 | UNINEXT-H |
| Visual Object Tracking | LaSOT | Normalized Precision | 80.8 | UNINEXT-H |
| Visual Object Tracking | LaSOT | Precision | 79.4 | UNINEXT-H |
| Visual Object Tracking | LaSOT-ext | AUC | 56.2 | UNINEXT-H |
| Visual Object Tracking | LaSOT-ext | Normalized Precision | 63.8 | UNINEXT-H |
| Visual Object Tracking | LaSOT-ext | Precision | 63.8 | UNINEXT-H |
| Visual Object Tracking | TrackingNet | Accuracy | 85.4 | UNINEXT-H |
| Visual Object Tracking | TrackingNet | Normalized Precision | 89 | UNINEXT-H |
| Visual Object Tracking | TrackingNet | Precision | 86.4 | UNINEXT-H |
| 16k | COCO minival | AP50 | 77.5 | UNINEXT-H |
| 16k | COCO minival | AP75 | 66.7 | UNINEXT-H |
| 16k | COCO minival | APL | 75.3 | UNINEXT-H |
| 16k | COCO minival | APM | 64.8 | UNINEXT-H |
| 16k | COCO minival | APS | 45.1 | UNINEXT-H |
| 16k | COCO minival | box AP | 60.6 | UNINEXT-H |
| 16k | Description Detection Dataset | Intra-scenario ABS mAP | 15.9 | UNINEXT-large |
| 16k | Description Detection Dataset | Intra-scenario FULL mAP | 17.9 | UNINEXT-large |
| 16k | Description Detection Dataset | Intra-scenario PRES mAP | 18.6 | UNINEXT-large |