TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Video-LaVIT

Video-LaVIT

Reported on 26 benchmarks across 7 tasks · 1 paper · 3 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing18 results

  • Question AnsweringonActivityNet-QA
    Accuracy· 2024-02-05
    50.1
    best: 61.6 (Tarsier (34B))
    SOTA
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Visual Question Answering (VQA)onMMBench
    GPT-3.5 score· 2024-02-05
    67.3
    best: 73.8 (LLaVA-InternLM2-ViT + MoSLoRA)
    SOTA
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Visual Question AnsweringonMMBench
    GPT-3.5 score· 2024-02-05
    67.3
    best: 73.8 (LLaVA-InternLM2-ViT + MoSLoRA)
    SOTA
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Question AnsweringonMSVD-QA
    Accuracy· 2024-02-05
    73.2
    best: 80.3 (Tarsier (34B))
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Question AnsweringonMSVD-QA
    Confidence Score· 2024-02-05
    3.9
    best: 2.5 (Video LLaMA-7B)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Question AnsweringonMSRVTT-QA
    Accuracy· 2024-02-05
    59.3
    best: 72.4 (Flash-VStream)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Question AnsweringonMSRVTT-QA
    Confidence Score· 2024-02-05
    3.3
    best: 1.8 (Video LLaMA-7B)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Question AnsweringonActivityNet-QA
    Confidence Score· 2024-02-05
    3.3
    best: 1.1 (Video LLaMA)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Question AnsweringonScienceQA
    Avg. Accuracy· 2024-02-05
    70
    best: 94.88 (MC-CoT F-Large)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Visual Question Answering (VQA)onVizWiz 2020 VQA
    overall· 2024-02-05
    56
    best: 73.3 (PaLI)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Visual Question Answering (VQA)onGQA test-dev
    Accuracy· 2024-02-05
    64.4
    best: 72.1 (CFR)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Visual Question Answering (VQA)onMM-Vet
    GPT-4 score· 2024-02-05
    33.2
    best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Video GenerationonUCF-101
    FVD16· 2024-02-05
    280.57
    best: 2460 (MCVD)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Video GenerationonUCF-101
    Inception Score· 2024-02-05
    44.26
    best: 87.68 (HPDM-L)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Text-to-Video GenerationonMSR-VTT
    CLIPSIM· 2024-02-05
    0.3012
    best: 0.3125 (PixelDance)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Text-to-Video GenerationonMSR-VTT
    FID· 2024-02-05
    11.27
    best: 8.19 (TF-T2V)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Text-to-Video GenerationonMSR-VTT
    FVD· 2024-02-05
    188.36
    best: 998 (MagicVideo)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Visual Question AnsweringonMM-Vet
    GPT-4 score· 2024-02-05
    33.2
    best: 74.24 (MMCTAgent (GPT-4 + GPT-4V))
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161

Reasoning6 results

  • Video Question AnsweringonMSVD-QA
    Accuracy· 2024-02-05
    73.2
    best: 80.3 (Tarsier (34B))
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Video Question AnsweringonMSVD-QA
    Confidence Score· 2024-02-05
    3.9
    best: 2.5 (Video LLaMA-7B)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Video Question AnsweringonMSRVTT-QA
    Accuracy· 2024-02-05
    59.3
    best: 72.4 (Flash-VStream)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Video Question AnsweringonMSRVTT-QA
    Confidence Score· 2024-02-05
    3.3
    best: 1.8 (Video LLaMA-7B)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Video Question AnsweringonActivityNet-QA
    Accuracy· 2024-02-05
    50.1
    best: 61.6 (Tarsier (34B))
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • Video Question AnsweringonActivityNet-QA
    Confidence Score· 2024-02-05
    3.3
    best: 1.1 (Video LLaMA)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161

Computer Vision2 results

  • VideoonUCF-101
    FVD16· 2024-02-05
    280.57
    best: 2460 (MCVD)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161
  • VideoonUCF-101
    Inception Score· 2024-02-05
    44.26
    best: 87.68 (HPDM-L)
    Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationarXiv:2402.03161