What You See is What You Read? Improving Text-Image Alignment Evaluation

Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

2023-05-17NeurIPS 2023 11Question Answering Text-to-Image Generation Text Generation Image to text Text to Image Generation Visual Reasoning Question Generation Image Generation Visual Question Answering

Paper PDF Code(official)

Abstract

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	30.5	VQ2
Visual Reasoning	Winoground	Image Score	42.2	VQ2
Visual Reasoning	Winoground	Text Score	47	VQ2
Visual Reasoning	Winoground	Group Score	28.75	PaLI (ft SNLI-VE + Synthetic Data)
Visual Reasoning	Winoground	Image Score	38	PaLI (ft SNLI-VE + Synthetic Data)
Visual Reasoning	Winoground	Text Score	46.5	PaLI (ft SNLI-VE + Synthetic Data)
Visual Reasoning	Winoground	Group Score	28.7	PaLI (ft SNLI-VE)
Visual Reasoning	Winoground	Image Score	41.5	PaLI (ft SNLI-VE)
Visual Reasoning	Winoground	Text Score	45	PaLI (ft SNLI-VE)
Visual Reasoning	Winoground	Group Score	23.5	BLIP2 (ft COCO)
Visual Reasoning	Winoground	Image Score	26	BLIP2 (ft COCO)
Visual Reasoning	Winoground	Text Score	44	BLIP2 (ft COCO)
Visual Reasoning	Winoground	Group Score	8.25	COCA ViT-L14 (f.t on COCO)
Visual Reasoning	Winoground	Image Score	11.5	COCA ViT-L14 (f.t on COCO)
Visual Reasoning	Winoground	Text Score	28.25	COCA ViT-L14 (f.t on COCO)
Visual Reasoning	Winoground	Group Score	9	OFA large (ft SNLI-VE)
Visual Reasoning	Winoground	Image Score	14.3	OFA large (ft SNLI-VE)
Visual Reasoning	Winoground	Text Score	27.7	OFA large (ft SNLI-VE)
Visual Reasoning	Winoground	Group Score	10.25	CLIP RN50x64
Visual Reasoning	Winoground	Image Score	13.75	CLIP RN50x64
Visual Reasoning	Winoground	Text Score	26.5	CLIP RN50x64
Visual Reasoning	Winoground	Group Score	11.3	TIFA
Visual Reasoning	Winoground	Image Score	12.5	TIFA
Visual Reasoning	Winoground	Text Score	19	TIFA

What You See is What You Read? Improving Text-Image Alignment Evaluation

Abstract

Results

Related Papers

What You See is What You Read? Improving Text-Image Alignment Evaluation

Abstract

Results

Related Papers