Video-RAG (based on LLaVA-Video)

Reported on 4 benchmarks across 2 tasks · 1 paper · 2 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing2 results

Question AnsweringonVideo-MME (w/o subs)
Accuracy (%)· 2024-11-20
77.4
SOTA
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension arXiv:2411.13093
Question AnsweringonZero-shot Video Question Answering on LongVideoBench
Accuracy (% )· uses extra data· 2024-11-20
65.4
best: 66.7 (Gemini 1.5 Pro)
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension arXiv:2411.13093

Reasoning2 results

Video Question AnsweringonVideo-MME (w/o subs)
Accuracy (%)· 2024-11-20
77.4
SOTA
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension arXiv:2411.13093
Video Question AnsweringonZero-shot Video Question Answering on LongVideoBench
Accuracy (% )· uses extra data· 2024-11-20
65.4
best: 66.7 (Gemini 1.5 Pro)
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension arXiv:2411.13093