Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Phrase Grounding | Flickr30k Entities Test | R@1 | 87.7 | GLIPv2 |
| Object Detection | LVIS v1.0 minival | box AP | 59.8 | GLIPv2 |
| Object Detection | COCO test-dev | box mAP | 62.4 | GLIPv2 (CoSwin-H, multi-scale) |
| Object Detection | ODinW Full-Shot 13 Tasks | AP | 70.4 | GLIPv2 |
| 3D | LVIS v1.0 minival | box AP | 59.8 | GLIPv2 |
| 3D | COCO test-dev | box mAP | 62.4 | GLIPv2 (CoSwin-H, multi-scale) |
| 3D | ODinW Full-Shot 13 Tasks | AP | 70.4 | GLIPv2 |
| Instance Segmentation | PhraseCut | Mean IoU | 61.3 | GLIPv2 |
| Referring Expression Segmentation | PhraseCut | Mean IoU | 61.3 | GLIPv2 |
| 2D Classification | LVIS v1.0 minival | box AP | 59.8 | GLIPv2 |
| 2D Classification | COCO test-dev | box mAP | 62.4 | GLIPv2 (CoSwin-H, multi-scale) |
| 2D Classification | ODinW Full-Shot 13 Tasks | AP | 70.4 | GLIPv2 |
| 2D Object Detection | LVIS v1.0 minival | box AP | 59.8 | GLIPv2 |
| 2D Object Detection | COCO test-dev | box mAP | 62.4 | GLIPv2 (CoSwin-H, multi-scale) |
| 2D Object Detection | ODinW Full-Shot 13 Tasks | AP | 70.4 | GLIPv2 |
| 16k | LVIS v1.0 minival | box AP | 59.8 | GLIPv2 |
| 16k | COCO test-dev | box mAP | 62.4 | GLIPv2 (CoSwin-H, multi-scale) |
| 16k | ODinW Full-Shot 13 Tasks | AP | 70.4 | GLIPv2 |