TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering (VQA)/MSVD-QA

Visual Question Answering (VQA) on MSVD-QA

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1VLAB0.61YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
2MA-LMM0.606NoMA-LMM: Memory-Augmented Large Multimodal Model ...2024-04-08Code
3MaMMUT (ours)0.602YesMaMMUT: A Simple Architecture for Joint Learning...2023-03-29Code
4VALOR0.6YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
5VAST0.6YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
6COSA0.6YesCOSA: Concatenated Sample Pretrained Vision-Lang...2023-06-15Code
7mPLUG-20.581YesmPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
8VideoCoCa0.569YesVideoCoCa: Video-Text Modeling with Zero-Shot Tr...2022-12-09-
9GIT0.568YesGIT: A Generative Image-to-text Transformer for ...2022-05-27Code
10FrozenBiLM+0.558NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
11HiTeA0.556YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
12InternVideo0.555YesInternVideo: General Video Foundation Models via...2022-12-06Code
13UMT-L (ViT-L/16)0.552YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
14vid-TLDR (UMT-L)0.549Yesvid-TLDR: Training Free Token merging for Light-...2024-03-20Code
15FrozenBiLM0.548NoZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code
16VIOLETv20.547YesAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
17MuLTI0.547YesMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
18X2-VLM (large)0.546YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
19X2-VLM (base)0.528YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
20Clover0.524YesClover: Towards A Unified Video-Language Alignme...2022-07-16Code
21VIOLET + MELTR0.517YesMELTR: Meta Loss Transformer for Learning to Fin...2023-03-23Code
22OmniVL0.51YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
23VIOLET+0.495NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
24Co-Tokenization0.486YesVideo Question Answering with Iterative Video-Te...2022-08-01-
25All-in-one-B0.483YesAll in One: Exploring Unified Video-Language Pre...2022-03-14Code
26LRCE0.478No--Code
27JustAsk+0.477NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
28GIT+MDF0.469NoSelf-Adaptive Sampling for Efficient Video Quest...2023-07-09Code
29AIO+MIF0.467NoSelf-Adaptive Sampling for Efficient Video Quest...2023-07-09Code
30Just Ask0.463NoJust Ask: Learning to Answer Questions from Mill...2020-12-01Code
31ALPRO0.459YesAlign and Prompt: Video-and-Language Pre-trainin...2021-12-17Code
32All-in-one+0.438NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
33DualVGR0.39NoDualVGR: A Dual-Visual Graph Reasoning Unit for ...2021-07-10Code
34HCRN0.361NoHierarchical Conditional Relation Networks for V...2020-02-25Code
35SSML0.351NoNoise Estimation Using Density Estimation for Se...2020-03-06Code
36HMEMA0.337NoHeterogeneous Memory Enhanced Multimodal Attenti...2019-04-08Code
37Co-Mem0.317NoMotion-Appearance Co-Memory Networks for Video Q...2018-03-29-
38ST-VQA0.313NoTGIF-QA: Toward Spatio-Temporal Reasoning in Vis...2017-04-14Code