TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering (VQA)/MSRVTT-QA

Visual Question Answering (VQA) on MSRVTT-QA

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1VLAB0.496YesVLAB: Enhancing Video Language Pre-training by F...2023-05-22-
2MaMMUT0.495YesMaMMUT: A Simple Architecture for Joint Learning...2023-03-29Code
3mPLUG-20.48YesmPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
4MuLTI0.478YesMuLTI: Efficient Video-and-Language Understandin...2023-03-10-
5Flamingo0.474YesFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
6InternVideo0.471YesInternVideo: General Video Foundation Models via...2022-12-06Code
7UMT-L (ViT-L/16)0.471YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
8FrozenBiLM+0.47NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
9vid-TLDR (UMT-L)0.47Yesvid-TLDR: Training Free Token merging for Light-...2024-03-20Code
10FrozenBiLM0.47NoZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code
11VideoCoCa0.463YesVideoCoCa: Video-Text Modeling with Zero-Shot Tr...2022-12-09-
12HBI0.462NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
13HiTeA0.459YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
14EMCL-Net0.458NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
15Co-Tokenization0.457YesVideo Question Answering with Iterative Video-Te...2022-08-01-
16X2-VLM (large)0.455YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
17X2-VLM (base)0.45YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
18All-in-one-B0.443YesAll in One: Exploring Unified Video-Language Pre...2022-03-14Code
19OmniVL0.441YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
20Clover0.441YesClover: Towards A Unified Video-Language Alignme...2022-07-16Code
21AIO+MIF0.44NoSelf-Adaptive Sampling for Efficient Video Quest...2023-07-09Code
22AIO+MDF0.438NoSelf-Adaptive Sampling for Efficient Video Quest...2023-07-09Code
23GIT+MDF0.423NoSelf-Adaptive Sampling for Efficient Video Quest...2023-07-09Code
24ALPRO0.421YesAlign and Prompt: Video-and-Language Pre-trainin...2021-12-17Code
25LRCE0.42No--Code
26JustAsk+0.418NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
27Just Ask0.415NoJust Ask: Learning to Answer Questions from Mill...2020-12-01Code
28All-in-one+0.395NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
29CLIPBERT0.374YesLess is More: ClipBERT for Video-and-Language Le...2021-02-11Code
30HCRN0.356NoHierarchical Conditional Relation Networks for V...2020-02-25Code
31DualVGR0.355NoDualVGR: A Dual-Visual Graph Reasoning Unit for ...2021-07-10Code
32SSML0.35NoNoise Estimation Using Density Estimation for Se...2020-03-06Code
33HMEMA0.33NoHeterogeneous Memory Enhanced Multimodal Attenti...2019-04-08Code
34Co-Mem0.32NoMotion-Appearance Co-Memory Networks for Video Q...2018-03-29-
35Flamingo (32-shot)0.31NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
36ST-VQA0.309NoTGIF-QA: Toward Spatio-Temporal Reasoning in Vis...2017-04-14Code
37Flamingo (0-shot)0.174NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code