Measuring Progress in Fine-grained Vision-and-Language Understanding

Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, Aida Nematzadeh

2023-05-12Visual Reasoning

Abstract

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	21.2	X-VLM 16M
Visual Reasoning	Winoground	Image Score	24.5	X-VLM 16M
Visual Reasoning	Winoground	Text Score	46.7	X-VLM 16M
Visual Reasoning	Winoground	Group Score	21.5	X-VLM 4M
Visual Reasoning	Winoground	Image Score	26.7	X-VLM 4M
Visual Reasoning	Winoground	Text Score	44	X-VLM 4M
Visual Reasoning	Winoground	Group Score	14.5	BLIP 14M
Visual Reasoning	Winoground	Image Score	18.5	BLIP 14M
Visual Reasoning	Winoground	Text Score	36.5	BLIP 14M
Visual Reasoning	Winoground	Group Score	11.7	BLIP 129M
Visual Reasoning	Winoground	Image Score	15	BLIP 129M
Visual Reasoning	Winoground	Text Score	35.5	BLIP 129M
Visual Reasoning	Winoground	Group Score	12.2	BLIP 129M (CapFilt/L)
Visual Reasoning	Winoground	Image Score	15.2	BLIP 129M (CapFilt/L)
Visual Reasoning	Winoground	Text Score	34.7	BLIP 129M (CapFilt/L)
Visual Reasoning	Winoground	Group Score	12.2	BLIP-ViT/L 129M
Visual Reasoning	Winoground	Image Score	14.5	BLIP-ViT/L 129M
Visual Reasoning	Winoground	Text Score	34.7	BLIP-ViT/L 129M
Visual Reasoning	Winoground	Group Score	12.2	PEVL 14M
Visual Reasoning	Winoground	Image Score	15.7	PEVL 14M
Visual Reasoning	Winoground	Text Score	33.2	PEVL 14M
Visual Reasoning	Winoground	Group Score	12.7	ALBEF 14M
Visual Reasoning	Winoground	Image Score	16.2	ALBEF 14M
Visual Reasoning	Winoground	Text Score	32.5	ALBEF 14M
Visual Reasoning	Winoground	Group Score	11	ALBEF 4M
Visual Reasoning	Winoground	Image Score	15.5	ALBEF 4M
Visual Reasoning	Winoground	Text Score	29.2	ALBEF 4M

Measuring Progress in Fine-grained Vision-and-Language Understanding

Abstract

Results

Related Papers

Measuring Progress in Fine-grained Vision-and-Language Understanding

Abstract

Results

Related Papers