TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VLFeedback: A Large-Scale AI Feedback Dataset for Large Vi...

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Lei LI, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu

2024-10-12HallucinationModels AlignmentRed TeamingVisual Question Answering
PaperPDF

Abstract

As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MM-VetGPT-4 score50.7Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback))
Visual Question Answering (VQA)MM-VetGPT-4 score49.9Silkie (Qwen-VL-Chat + DPO w/ VLFeedback)
Visual Question Answering (VQA)MM-VetGPT-4 score44.2LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback)
Visual Question Answering (VQA)MM-VetGPT-4 score44.1LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback)
Visual Question AnsweringMM-VetGPT-4 score50.7Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback))
Visual Question AnsweringMM-VetGPT-4 score49.9Silkie (Qwen-VL-Chat + DPO w/ VLFeedback)
Visual Question AnsweringMM-VetGPT-4 score44.2LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback)
Visual Question AnsweringMM-VetGPT-4 score44.1LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback)

Related Papers

Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09UQLM: A Python Package for Uncertainty Quantification in Large Language Models2025-07-08