Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Daechul Ahn, Yura Choi, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

2024-02-06Video-based Generative Performance Benchmarking

Abstract

Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to enrich the understanding of video content. Demonstrating enhanced performance across diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	VideoInstruct	Consistency	3.32	VLM-RLAIF
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	4	VLM-RLAIF
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.63	VLM-RLAIF
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	3.25	VLM-RLAIF
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	3.23	VLM-RLAIF
Visual Question Answering (VQA)	VideoInstruct	mean	3.49	VLM-RLAIF
Generative Visual Question Answering	VideoInstruct	Consistency	3.32	VLM-RLAIF
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	4	VLM-RLAIF
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.63	VLM-RLAIF
Generative Visual Question Answering	VideoInstruct	Detail Orientation	3.25	VLM-RLAIF
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	3.23	VLM-RLAIF
Generative Visual Question Answering	VideoInstruct	mean	3.49	VLM-RLAIF
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	3.32	VLM-RLAIF
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	4	VLM-RLAIF
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.63	VLM-RLAIF
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	3.25	VLM-RLAIF
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	3.23	VLM-RLAIF
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.49	VLM-RLAIF

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Abstract

Results

Related Papers

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Abstract

Results

Related Papers