Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Video-LaVIT

Video-LaVIT

Reported on 26 benchmarks across 7 tasks · 1 paper · 3 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing18 results

Question AnsweringonActivityNet-QA
Accuracy· 2024-02-05
50.1
best: 61.6 (Tarsier (34B))
SOTA
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Visual Question Answering (VQA)onMMBench
GPT-3.5 score· 2024-02-05
67.3
best: 73.8 (LLaVA-InternLM2-ViT + MoSLoRA)
SOTA
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Visual Question AnsweringonMMBench
GPT-3.5 score· 2024-02-05
67.3
best: 73.8 (LLaVA-InternLM2-ViT + MoSLoRA)
SOTA
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Question AnsweringonMSVD-QA
Accuracy· 2024-02-05
73.2
best: 80.3 (Tarsier (34B))
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Question AnsweringonMSVD-QA
Confidence Score· 2024-02-05
3.9
best: 2.5 (Video LLaMA-7B)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Question AnsweringonMSRVTT-QA
Accuracy· 2024-02-05
59.3
best: 72.4 (Flash-VStream)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Question AnsweringonMSRVTT-QA
Confidence Score· 2024-02-05
3.3
best: 1.8 (Video LLaMA-7B)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Question AnsweringonActivityNet-QA
Confidence Score· 2024-02-05
3.3
best: 1.1 (Video LLaMA)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Question AnsweringonScienceQA
Avg. Accuracy· 2024-02-05
70
best: 94.88 (MC-CoT F-Large)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Visual Question Answering (VQA)onVizWiz 2020 VQA
overall· 2024-02-05
56
best: 73.3 (PaLI)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Visual Question Answering (VQA)onGQA test-dev
Accuracy· 2024-02-05
64.4
best: 72.1 (CFR)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Visual Question Answering (VQA)onMM-Vet
GPT-4 score· 2024-02-05
33.2
best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Video GenerationonUCF-101
FVD16· 2024-02-05
280.57
best: 2460 (MCVD)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Video GenerationonUCF-101
Inception Score· 2024-02-05
44.26
best: 87.68 (HPDM-L)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Text-to-Video GenerationonMSR-VTT
CLIPSIM· 2024-02-05
0.3012
best: 0.3125 (PixelDance)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Text-to-Video GenerationonMSR-VTT
FID· 2024-02-05
11.27
best: 8.19 (TF-T2V)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Text-to-Video GenerationonMSR-VTT
FVD· 2024-02-05
188.36
best: 998 (MagicVideo)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Visual Question AnsweringonMM-Vet
GPT-4 score· 2024-02-05
33.2
best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161

Reasoning6 results

Video Question AnsweringonMSVD-QA
Accuracy· 2024-02-05
73.2
best: 80.3 (Tarsier (34B))
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Video Question AnsweringonMSVD-QA
Confidence Score· 2024-02-05
3.9
best: 2.5 (Video LLaMA-7B)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Video Question AnsweringonMSRVTT-QA
Accuracy· 2024-02-05
59.3
best: 72.4 (Flash-VStream)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Video Question AnsweringonMSRVTT-QA
Confidence Score· 2024-02-05
3.3
best: 1.8 (Video LLaMA-7B)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Video Question AnsweringonActivityNet-QA
Accuracy· 2024-02-05
50.1
best: 61.6 (Tarsier (34B))
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
Video Question AnsweringonActivityNet-QA
Confidence Score· 2024-02-05
3.3
best: 1.1 (Video LLaMA)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161

Computer Vision2 results

VideoonUCF-101
FVD16· 2024-02-05
280.57
best: 2460 (MCVD)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161
VideoonUCF-101
Inception Score· 2024-02-05
44.26
best: 87.68 (HPDM-L)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization arXiv:2402.03161