TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GLIPv2: Unifying Localization and Vision-Language Understa...

GLIPv2: Unifying Localization and Vision-Language Understanding

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

2022-06-12Masked Language ModelingReferring Expression SegmentationSemantic SegmentationImage CaptioningPhrase GroundingContrastive LearningOpen Vocabulary Object DetectionInstance Segmentation2D Object DetectionVisual Question Answering (VQA)Object DetectionLanguage Modelling
PaperPDFCode(official)

Abstract

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

Results

TaskDatasetMetricValueModel
Phrase GroundingFlickr30k Entities TestR@187.7GLIPv2
Object DetectionLVIS v1.0 minivalbox AP59.8GLIPv2
Object DetectionCOCO test-devbox mAP62.4GLIPv2 (CoSwin-H, multi-scale)
Object DetectionODinW Full-Shot 13 TasksAP70.4GLIPv2
3DLVIS v1.0 minivalbox AP59.8GLIPv2
3DCOCO test-devbox mAP62.4GLIPv2 (CoSwin-H, multi-scale)
3DODinW Full-Shot 13 TasksAP70.4GLIPv2
Instance SegmentationPhraseCutMean IoU61.3GLIPv2
Referring Expression SegmentationPhraseCutMean IoU61.3GLIPv2
2D ClassificationLVIS v1.0 minivalbox AP59.8GLIPv2
2D ClassificationCOCO test-devbox mAP62.4GLIPv2 (CoSwin-H, multi-scale)
2D ClassificationODinW Full-Shot 13 TasksAP70.4GLIPv2
2D Object DetectionLVIS v1.0 minivalbox AP59.8GLIPv2
2D Object DetectionCOCO test-devbox mAP62.4GLIPv2 (CoSwin-H, multi-scale)
2D Object DetectionODinW Full-Shot 13 TasksAP70.4GLIPv2
16kLVIS v1.0 minivalbox AP59.8GLIPv2
16kCOCO test-devbox mAP62.4GLIPv2 (CoSwin-H, multi-scale)
16kODinW Full-Shot 13 TasksAP70.4GLIPv2

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17