TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Simple Token-Level Confidence Improves Caption Correctness

Simple Token-Level Confidence Improves Caption Correctness

Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

2023-05-11HallucinationImage CaptioningVisual ReasoningLanguage Modelling
PaperPDF

Abstract

The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score7.25OFA large (ITM)
Visual ReasoningWinogroundImage Score10.25OFA large (ITM)
Visual ReasoningWinogroundText Score30.75OFA large (ITM)
Visual ReasoningWinogroundGroup Score17.5OFA large (TLC-A)
Visual ReasoningWinogroundImage Score27OFA large (TLC-A)
Visual ReasoningWinogroundText Score29.25OFA large (TLC-A)
Visual ReasoningWinogroundGroup Score6.5OFA base (ITM)
Visual ReasoningWinogroundImage Score10.75OFA base (ITM)
Visual ReasoningWinogroundText Score26.75OFA base (ITM)
Visual ReasoningWinogroundGroup Score13.75OFA base (TLC-A)
Visual ReasoningWinogroundImage Score23.5OFA base (TLC-A)
Visual ReasoningWinogroundText Score24.5OFA base (TLC-A)
Visual ReasoningWinogroundGroup Score4.5OFA tiny (ITM)
Visual ReasoningWinogroundImage Score7.75OFA tiny (ITM)
Visual ReasoningWinogroundText Score22.75OFA tiny (ITM)
Visual ReasoningWinogroundGroup Score6.75OFA tiny (TLC-A)
Visual ReasoningWinogroundImage Score15.75OFA tiny (TLC-A)
Visual ReasoningWinogroundText Score16.5OFA tiny (TLC-A)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16