Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
MSRVTT-QA
Visual Question Answering (VQA) on MSRVTT-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
VLAB
0.496
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
2
MaMMUT
0.495
Yes
MaMMUT: A Simple Architecture for Joint Learning...
2023-03-29
Code
3
mPLUG-2
0.48
Yes
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
4
MuLTI
0.478
Yes
MuLTI: Efficient Video-and-Language Understandin...
2023-03-10
-
5
Flamingo
0.474
Yes
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
6
InternVideo
0.471
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
7
UMT-L (ViT-L/16)
0.471
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
8
FrozenBiLM+
0.47
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
9
vid-TLDR (UMT-L)
0.47
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
10
FrozenBiLM
0.47
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
11
VideoCoCa
0.463
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
12
HBI
0.462
No
Video-Text as Game Players: Hierarchical Banzhaf...
2023-03-25
Code
13
HiTeA
0.459
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
14
EMCL-Net
0.458
No
Expectation-Maximization Contrastive Learning fo...
2022-11-21
Code
15
Co-Tokenization
0.457
Yes
Video Question Answering with Iterative Video-Te...
2022-08-01
-
16
X2-VLM (large)
0.455
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
17
X2-VLM (base)
0.45
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
18
All-in-one-B
0.443
Yes
All in One: Exploring Unified Video-Language Pre...
2022-03-14
Code
19
OmniVL
0.441
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
20
Clover
0.441
Yes
Clover: Towards A Unified Video-Language Alignme...
2022-07-16
Code
21
AIO+MIF
0.44
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
22
AIO+MDF
0.438
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
23
GIT+MDF
0.423
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
24
ALPRO
0.421
Yes
Align and Prompt: Video-and-Language Pre-trainin...
2021-12-17
Code
25
LRCE
0.42
No
-
-
Code
26
JustAsk+
0.418
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
27
Just Ask
0.415
No
Just Ask: Learning to Answer Questions from Mill...
2020-12-01
Code
28
All-in-one+
0.395
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
29
CLIPBERT
0.374
Yes
Less is More: ClipBERT for Video-and-Language Le...
2021-02-11
Code
30
HCRN
0.356
No
Hierarchical Conditional Relation Networks for V...
2020-02-25
Code
31
DualVGR
0.355
No
DualVGR: A Dual-Visual Graph Reasoning Unit for ...
2021-07-10
Code
32
SSML
0.35
No
Noise Estimation Using Density Estimation for Se...
2020-03-06
Code
33
HMEMA
0.33
No
Heterogeneous Memory Enhanced Multimodal Attenti...
2019-04-08
Code
34
Co-Mem
0.32
No
Motion-Appearance Co-Memory Networks for Video Q...
2018-03-29
-
35
Flamingo (32-shot)
0.31
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
36
ST-VQA
0.309
No
TGIF-QA: Toward Spatio-Temporal Reasoning in Vis...
2017-04-14
Code
37
Flamingo (0-shot)
0.174
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code