TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/What You See is What You Read? Improving Text-Image Alignm...

What You See is What You Read? Improving Text-Image Alignment Evaluation

Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

2023-05-17NeurIPS 2023 11Question AnsweringText-to-Image GenerationText GenerationImage to textText to Image GenerationVisual ReasoningQuestion GenerationImage GenerationVisual Question Answering
PaperPDFCode(official)

Abstract

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score30.5VQ2
Visual ReasoningWinogroundImage Score42.2VQ2
Visual ReasoningWinogroundText Score47VQ2
Visual ReasoningWinogroundGroup Score28.75PaLI (ft SNLI-VE + Synthetic Data)
Visual ReasoningWinogroundImage Score38PaLI (ft SNLI-VE + Synthetic Data)
Visual ReasoningWinogroundText Score46.5PaLI (ft SNLI-VE + Synthetic Data)
Visual ReasoningWinogroundGroup Score28.7PaLI (ft SNLI-VE)
Visual ReasoningWinogroundImage Score41.5PaLI (ft SNLI-VE)
Visual ReasoningWinogroundText Score45PaLI (ft SNLI-VE)
Visual ReasoningWinogroundGroup Score23.5BLIP2 (ft COCO)
Visual ReasoningWinogroundImage Score26BLIP2 (ft COCO)
Visual ReasoningWinogroundText Score44BLIP2 (ft COCO)
Visual ReasoningWinogroundGroup Score8.25COCA ViT-L14 (f.t on COCO)
Visual ReasoningWinogroundImage Score11.5COCA ViT-L14 (f.t on COCO)
Visual ReasoningWinogroundText Score28.25COCA ViT-L14 (f.t on COCO)
Visual ReasoningWinogroundGroup Score9OFA large (ft SNLI-VE)
Visual ReasoningWinogroundImage Score14.3OFA large (ft SNLI-VE)
Visual ReasoningWinogroundText Score27.7OFA large (ft SNLI-VE)
Visual ReasoningWinogroundGroup Score10.25CLIP RN50x64
Visual ReasoningWinogroundImage Score13.75CLIP RN50x64
Visual ReasoningWinogroundText Score26.5CLIP RN50x64
Visual ReasoningWinogroundGroup Score11.3TIFA
Visual ReasoningWinogroundImage Score12.5TIFA
Visual ReasoningWinogroundText Score19TIFA

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17