TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SelfEval: Leveraging the discriminative nature of generati...

SelfEval: Leveraging the discriminative nature of generative models for evaluation

Sai Saketh Rambhatla, Ishan Misra

2023-11-17AttributeVisual Reasoning
PaperPDF

Abstract

We present an automated way to evaluate the text alignment of text-to-image generative diffusion models using standard image-text recognition datasets. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, and the likelihood can be used to perform recognition tasks with the generative model. We evaluate generative models on standard datasets created for multimodal text-image discriminative learning and assess fine-grained aspects of their performance: attribute binding, color recognition, counting, shape recognition, spatial understanding. Existing automated metrics rely on an external pretrained model like CLIP (VLMs) or LLMs, and are sensitive to the exact pretrained model and its limitations. SelfEval sidesteps these issues, and to the best of our knowledge, is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple generative models, benchmarks and evaluation metrics. SelfEval also reveals that generative models showcase competitive recognition performance on challenging tasks such as Winoground image-score compared to discriminative models. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundImage Score12.75OCLIP (ViT-H/14)
Visual ReasoningWinogroundText Score30.75OCLIP (ViT-H/14)
Visual ReasoningWinogroundImage Score8CLIP (ViT-L/14)
Visual ReasoningWinogroundText Score30.25CLIP (ViT-L/14)
Visual ReasoningWinogroundImage Score13.5LDM-T5 (SelfEval)
Visual ReasoningWinogroundText Score29LDM-T5 (SelfEval)
Visual ReasoningWinogroundImage Score12PDM-T5 (SelfEval)
Visual ReasoningWinogroundText Score28.25PDM-T5 (SelfEval)
Visual ReasoningWinogroundImage Score7.25LDM-CLIP (SelfEval)
Visual ReasoningWinogroundText Score22.75LDM-CLIP (SelfEval)
Visual ReasoningWinogroundImage Score14PDM-CLIP (SelfEval)
Visual ReasoningWinogroundText Score17PDM-CLIP (SelfEval)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Non-Adaptive Adversarial Face Generation2025-07-16Attributes Shape the Embedding Space of Face Recognition Models2025-07-15COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation2025-07-15Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13Model Parallelism With Subnetwork Data Parallelism2025-07-11