Yan Zeng, Xinsong Zhang, Hang Li
Most existing methods in vision language pre-training rely on object-centric features extracted through object detection and make fine-grained alignments between the extracted features and texts. It is challenging for these methods to learn relations among multiple objects. To this end, we propose a new method called X-VLM to perform `multi-grained vision language pre-training.' The key to learning multi-grained alignments is to locate visual concepts in the image given the associated texts, and in the meantime align the texts with the visual concepts, where the alignments are in multi-granularity. Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 78.22 | X-VLM (base) |
| Visual Reasoning | NLVR2 Dev | Accuracy | 84.41 | X-VLM (base) |
| Visual Reasoning | NLVR2 Test | Accuracy | 84.76 | X-VLM (base) |
| Image Captioning | COCO Captions | BLEU-4 | 41.3 | X-VLM (base) |
| Image Captioning | COCO Captions | CIDER | 140.8 | X-VLM (base) |
| Image Retrieval | Flickr30K 1K test | R@1 | 86.9 | X-VLM (base) |
| Image Retrieval | Flickr30K 1K test | R@10 | 98.7 | X-VLM (base) |
| Image Retrieval | Flickr30K 1K test | R@5 | 97.3 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 97.1 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 100 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 100 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 86.9 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 98.7 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 97.3 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 81.2 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 98.2 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 95.6 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 63.4 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 91.5 | X-VLM (base) |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 85.8 | X-VLM (base) |
| Object Detection | OVAD-Box benchmark | mean average precision | 28 | X-VLM |
| 3D | OVAD-Box benchmark | mean average precision | 28 | X-VLM |
| Visual Grounding | RefCOCO+ test B | Accuracy (%) | 76.91 | X-VLM (base) |
| Visual Grounding | RefCOCO+ val | Accuracy (%) | 84.51 | X-VLM (base) |
| Visual Grounding | RefCOCO+ testA | Accuracy (%) | 89 | X-VLM (base) |
| 2D Classification | OVAD-Box benchmark | mean average precision | 28 | X-VLM |
| 2D Object Detection | OVAD-Box benchmark | mean average precision | 28 | X-VLM |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@1 | 97.1 | X-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@10 | 100 | X-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@5 | 100 | X-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 86.9 | X-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 98.7 | X-VLM (base) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 97.3 | X-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@1 | 81.2 | X-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@10 | 98.2 | X-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Image-to-text R@5 | 95.6 | X-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@1 | 63.4 | X-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@10 | 91.5 | X-VLM (base) |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 85.8 | X-VLM (base) |
| Open Vocabulary Object Detection | OVAD-Box benchmark | mean average precision | 28 | X-VLM |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@1 | 97.1 | X-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@10 | 100 | X-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@5 | 100 | X-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 86.9 | X-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 98.7 | X-VLM (base) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 97.3 | X-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@1 | 81.2 | X-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@10 | 98.2 | X-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Image-to-text R@5 | 95.6 | X-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@1 | 63.4 | X-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@10 | 91.5 | X-VLM (base) |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 85.8 | X-VLM (base) |
| 16k | OVAD-Box benchmark | mean average precision | 28 | X-VLM |