Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Reasoning | Winoground | Group Score | 30.5 | VQ2 |
| Visual Reasoning | Winoground | Image Score | 42.2 | VQ2 |
| Visual Reasoning | Winoground | Text Score | 47 | VQ2 |
| Visual Reasoning | Winoground | Group Score | 28.75 | PaLI (ft SNLI-VE + Synthetic Data) |
| Visual Reasoning | Winoground | Image Score | 38 | PaLI (ft SNLI-VE + Synthetic Data) |
| Visual Reasoning | Winoground | Text Score | 46.5 | PaLI (ft SNLI-VE + Synthetic Data) |
| Visual Reasoning | Winoground | Group Score | 28.7 | PaLI (ft SNLI-VE) |
| Visual Reasoning | Winoground | Image Score | 41.5 | PaLI (ft SNLI-VE) |
| Visual Reasoning | Winoground | Text Score | 45 | PaLI (ft SNLI-VE) |
| Visual Reasoning | Winoground | Group Score | 23.5 | BLIP2 (ft COCO) |
| Visual Reasoning | Winoground | Image Score | 26 | BLIP2 (ft COCO) |
| Visual Reasoning | Winoground | Text Score | 44 | BLIP2 (ft COCO) |
| Visual Reasoning | Winoground | Group Score | 8.25 | COCA ViT-L14 (f.t on COCO) |
| Visual Reasoning | Winoground | Image Score | 11.5 | COCA ViT-L14 (f.t on COCO) |
| Visual Reasoning | Winoground | Text Score | 28.25 | COCA ViT-L14 (f.t on COCO) |
| Visual Reasoning | Winoground | Group Score | 9 | OFA large (ft SNLI-VE) |
| Visual Reasoning | Winoground | Image Score | 14.3 | OFA large (ft SNLI-VE) |
| Visual Reasoning | Winoground | Text Score | 27.7 | OFA large (ft SNLI-VE) |
| Visual Reasoning | Winoground | Group Score | 10.25 | CLIP RN50x64 |
| Visual Reasoning | Winoground | Image Score | 13.75 | CLIP RN50x64 |
| Visual Reasoning | Winoground | Text Score | 26.5 | CLIP RN50x64 |
| Visual Reasoning | Winoground | Group Score | 11.3 | TIFA |
| Visual Reasoning | Winoground | Image Score | 12.5 | TIFA |
| Visual Reasoning | Winoground | Text Score | 19 | TIFA |