Revisiting the Role of Language Priors in Vision-Language Models

Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

2023-06-02Question Answering Image-text Retrieval Image-text matching Text Matching Text Retrieval Visual Reasoning Retrieval Visual Question Answering (VQA)Language Modelling Visual Question Answering

Paper PDF Code(official)

Abstract

Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	16.8	BLIP (VisualGPTScore, α-tuned)
Visual Reasoning	Winoground	Image Score	21.5	BLIP (VisualGPTScore, α-tuned)
Visual Reasoning	Winoground	Text Score	36.5	BLIP (VisualGPTScore, α-tuned)
Visual Reasoning	Winoground	Group Score	13.3	BLIP (ITM)
Visual Reasoning	Winoground	Image Score	15.8	BLIP (ITM)
Visual Reasoning	Winoground	Text Score	35.8	BLIP (ITM)
Visual Reasoning	Winoground	Group Score	6.5	BLIP (ITC)
Visual Reasoning	Winoground	Image Score	9	BLIP (ITC)
Visual Reasoning	Winoground	Text Score	28	BLIP (ITC)

Revisiting the Role of Language Priors in Vision-Language Models

Abstract

Results

Related Papers

Revisiting the Role of Language Priors in Vision-Language Models

Abstract

Results

Related Papers