TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Tuning Large Multimodal Models for Videos using Reinforcem...

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Daechul Ahn, Yura Choi, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

2024-02-06Video-based Generative Performance Benchmarking
PaperPDFCode(official)

Abstract

Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to enrich the understanding of video content. Demonstrating enhanced performance across diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VideoInstructConsistency3.32VLM-RLAIF
Visual Question Answering (VQA)VideoInstructContextual Understanding4VLM-RLAIF
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.63VLM-RLAIF
Visual Question Answering (VQA)VideoInstructDetail Orientation3.25VLM-RLAIF
Visual Question Answering (VQA)VideoInstructTemporal Understanding3.23VLM-RLAIF
Visual Question Answering (VQA)VideoInstructmean3.49VLM-RLAIF
Generative Visual Question AnsweringVideoInstructConsistency3.32VLM-RLAIF
Generative Visual Question AnsweringVideoInstructContextual Understanding4VLM-RLAIF
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.63VLM-RLAIF
Generative Visual Question AnsweringVideoInstructDetail Orientation3.25VLM-RLAIF
Generative Visual Question AnsweringVideoInstructTemporal Understanding3.23VLM-RLAIF
Generative Visual Question AnsweringVideoInstructmean3.49VLM-RLAIF
Video-based Generative Performance BenchmarkingVideoInstructConsistency3.32VLM-RLAIF
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding4VLM-RLAIF
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.63VLM-RLAIF
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3.25VLM-RLAIF
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding3.23VLM-RLAIF
Video-based Generative Performance BenchmarkingVideoInstructmean3.49VLM-RLAIF

Related Papers

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models2024-11-17PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance2024-11-04SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models2024-07-22VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding2024-06-13PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning2024-04-25ST-LLM: Large Language Models Are Effective Temporal Learners2024-03-30An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM2024-03-27LITA: Language Instructed Temporal-Localization Assistant2024-03-27