Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Video-LLaVA

Video-LLaVA

Reported on 9 benchmarks across 5 tasks · 2 papers

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Reasoning6 results

Emotion InterpretationonEIBench (complex)
Recall· 2025-04-10
30.9
best: 39.27 (ChatGPT-4o)
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models arXiv:2504.07521
Emotion InterpretationonEIBench
Recall· 2025-04-10
49.26
best: 63.24 (Claude-3-haiku)
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models arXiv:2504.07521
Video Question AnsweringonActivityNet-QA
Accuracy· 2023-11-16
45.3
best: 61.6 (Tarsier (34B))
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122
Video Question AnsweringonActivityNet-QA
Confidence score· 2023-11-16
3.3
best: 2.2 (Video Chat)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122
Video Question AnsweringonActivityNet-QA
Accuracy· 2023-11-16
45.3
best: 61.6 (Tarsier (34B))
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122
Video Question AnsweringonActivityNet-QA
Confidence Score· 2023-11-16
3.3
best: 1.1 (Video LLaMA)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122

Natural Language Processing4 results

Question AnsweringonActivityNet-QA
Accuracy· 2023-11-16
45.3
best: 61.6 (Tarsier (34B))
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122
Question AnsweringonActivityNet-QA
Confidence Score· 2023-11-16
3.3
best: 1.1 (Video LLaMA)
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122
Visual Question Answering (VQA)onMM-Vet
GPT-4 score· 2023-11-16
32
best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122
Visual Question AnsweringonMM-Vet
GPT-4 score· 2023-11-16
32
best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection arXiv:2311.10122