TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video-RAG: Visually-aligned Retrieval-Augmented Long Video...

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, Rongrong Ji

2024-11-20Zero-Shot Video Question AnswerVideo RetrievalVideo UnderstandingRetrievalobject-detectionRAGObject Detection
PaperPDFCode(official)

Abstract

Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

Results

TaskDatasetMetricValueModel
Question AnsweringVideo-MME (w/o subs)Accuracy (%)77.4Video-RAG (based on LLaVA-Video)
Question AnsweringZero-shot Video Question Answering on LongVideoBenchAccuracy (% )65.4Video-RAG (based on LLaVA-Video)
Question AnsweringVideo-MMEAccuracy (%)77.4Video-RAG (Based on LLaVA-Video)
Question AnsweringEgoSchema (fullset)Accuracy66.7Video-RAG (Based on LLaVA-Video)
Video Question AnsweringVideo-MME (w/o subs)Accuracy (%)77.4Video-RAG (based on LLaVA-Video)
Video Question AnsweringZero-shot Video Question Answering on LongVideoBenchAccuracy (% )65.4Video-RAG (based on LLaVA-Video)
Video Question AnsweringVideo-MMEAccuracy (%)77.4Video-RAG (Based on LLaVA-Video)
Video Question AnsweringEgoSchema (fullset)Accuracy66.7Video-RAG (Based on LLaVA-Video)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17