TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Tru...

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan YAO, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

2024-05-27CVPR 2025 1HallucinationImage CaptioningVisual Question Answering
PaperPDFCode(official)CodeCode(official)CodeCode

Abstract

Traditional feedback learning for hallucination reduction relies on labor-intensive manual labeling or expensive proprietary models. This leaves the community without foundational knowledge about how to build high-quality feedback with open-source MLLMs. In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm. RLAIF-V maximally explores open-source MLLMs from two perspectives, including high-quality feedback data generation for preference learning and self-feedback guidance for inference-time scaling. Extensive experiments on six benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models at both preference learning and inference time. RLAIF-V 7B reduces object hallucination by 80.7\% and overall hallucination by 33.7\%. Remarkably, RLAIF-V 12B further reveals the self-alignment potential of open-source MLLMs, where the model can learn from feedback of itself to achieve super GPT-4V trustworthiness.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MMHal-BenchHallucination Rate29.2RLAIF-V 12B
Visual Question Answering (VQA)MMHal-BenchScore3.36RLAIF-V 12B
Visual Question Answering (VQA)MMHal-BenchHallucination Rate29.2RLAIF-V 7B
Visual Question Answering (VQA)MMHal-BenchScore3.06RLAIF-V 7B
Visual Question Answering (VQA)AMBERAccuracy88RLAIF-V 12B
Visual Question Answering (VQA)AMBERF190.9RLAIF-V 12B
Image CaptioningObject HalBenchchair_i4.3RLAIF-V 7B
Image CaptioningObject HalBenchchair_s8.5RLAIF-V 7B
Image CaptioningObject HalBenchchair_i1.8RLAIF-V 12B
Image CaptioningObject HalBenchchair_s3.3RLAIF-V 12B
Visual Question AnsweringMMHal-BenchHallucination Rate29.2RLAIF-V 12B
Visual Question AnsweringMMHal-BenchScore3.36RLAIF-V 12B
Visual Question AnsweringMMHal-BenchHallucination Rate29.2RLAIF-V 7B
Visual Question AnsweringMMHal-BenchScore3.06RLAIF-V 7B
Visual Question AnsweringAMBERAccuracy88RLAIF-V 12B
Visual Question AnsweringAMBERF190.9RLAIF-V 12B

Related Papers

Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09