SeViLA

Reported on 4 benchmarks across 2 tasks · 1 paper · 2 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Reasoning3 results

Video Question AnsweringonSTAR Benchmark
Average Accuracy· 2023-05-11
64.9
best: 67.1 (VLAP (4 frames))
SOTA
Self-Chained Image-Language Model for Video Localization and Question Answering arXiv:2305.06988
Video Question AnsweringonNExT-QA
Accuracy· 2023-05-11
73.8
best: 85.5 (LinVT-Qwen2-VL (7B))
SOTA
Self-Chained Image-Language Model for Video Localization and Question Answering arXiv:2305.06988
Video Question AnsweringonHow2QA
Accuracy
83.7
best: 93.2 (Text + Text (no Multimodal Pretext Training))

Methodology1 result

Zero-Shot LearningonHow2QA
Accuracy
72.3