Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
MSRVTT-QA
Visual Question Answering (VQA) on MSRVTT-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
VLAB
0.496
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
2
MaMMUT
0.495
Yes
MaMMUT: A Simple Architecture for Joint Learning...
2023-03-29
Code
3
mPLUG-2
0.48
Yes
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
4
MuLTI
0.478
Yes
MuLTI: Efficient Video-and-Language Understandin...
2023-03-10
-
5
Flamingo
0.474
Yes
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
6
InternVideo
0.471
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
7
UMT-L (ViT-L/16)
0.471
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
8
FrozenBiLM+
0.47
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
9
vid-TLDR (UMT-L)
0.47
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
10
FrozenBiLM
0.47
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
11
VideoCoCa
0.463
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
12
HBI
0.462
No
Video-Text as Game Players: Hierarchical Banzhaf...
2023-03-25
Code
13
HiTeA
0.459
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
14
EMCL-Net
0.458
No
Expectation-Maximization Contrastive Learning fo...
2022-11-21
Code
15
Co-Tokenization
0.457
Yes
Video Question Answering with Iterative Video-Te...
2022-08-01
-
16
X2-VLM (large)
0.455
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
17
X2-VLM (base)
0.45
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
18
All-in-one-B
0.443
Yes
All in One: Exploring Unified Video-Language Pre...
2022-03-14
Code
19
OmniVL
0.441
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
20
Clover
0.441
Yes
Clover: Towards A Unified Video-Language Alignme...
2022-07-16
Code
21
AIO+MIF
0.44
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
22
AIO+MDF
0.438
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
23
GIT+MDF
0.423
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
24
ALPRO
0.421
Yes
Align and Prompt: Video-and-Language Pre-trainin...
2021-12-17
Code
25
LRCE
0.42
No
-
-
Code
26
JustAsk+
0.418
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
27
Just Ask
0.415
No
Just Ask: Learning to Answer Questions from Mill...
2020-12-01
Code
28
All-in-one+
0.395
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
29
CLIPBERT
0.374
Yes
Less is More: ClipBERT for Video-and-Language Le...
2021-02-11
Code
30
HCRN
0.356
No
Hierarchical Conditional Relation Networks for V...
2020-02-25
Code
31
DualVGR
0.355
No
DualVGR: A Dual-Visual Graph Reasoning Unit for ...
2021-07-10
Code
32
SSML
0.35
No
Noise Estimation Using Density Estimation for Se...
2020-03-06
Code
33
HMEMA
0.33
No
Heterogeneous Memory Enhanced Multimodal Attenti...
2019-04-08
Code
34
Co-Mem
0.32
No
Motion-Appearance Co-Memory Networks for Video Q...
2018-03-29
-
35
Flamingo (32-shot)
0.31
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
36
ST-VQA
0.309
No
TGIF-QA: Toward Spatio-Temporal Reasoning in Vis...
2017-04-14
Code
37
Flamingo (0-shot)
0.174
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
#1
VLAB
SOTA
0.496
Accuracy
· Extra Data
· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#2
MaMMUT
SOTA
0.495
Accuracy
· Extra Data
· 2023-03-29
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Code
#3
mPLUG-2
SOTA
0.48
Accuracy
· Extra Data
· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Code
#4
MuLTI
0.478
Accuracy
· Extra Data
· 2023-03-10
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
#5
Flamingo
SOTA
0.474
Accuracy
· Extra Data
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#6
InternVideo
0.471
Accuracy
· Extra Data
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#7
UMT-L (ViT-L/16)
0.471
Accuracy
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#8
FrozenBiLM+
0.47
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#9
vid-TLDR (UMT-L)
0.47
Accuracy
· Extra Data
· 2024-03-20
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Code
#10
FrozenBiLM
0.47
Accuracy
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code
#11
VideoCoCa
0.463
Accuracy
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#12
HBI
0.462
Accuracy
· 2023-03-25
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Code
#13
HiTeA
0.459
Accuracy
· Extra Data
· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#14
EMCL-Net
0.458
Accuracy
· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Code
#15
Co-Tokenization
0.457
Accuracy
· Extra Data
· 2022-08-01
Video Question Answering with Iterative Video-Text Co-Tokenization
#16
X2-VLM (large)
0.455
Accuracy
· Extra Data
· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Code
#17
X2-VLM (base)
0.45
Accuracy
· Extra Data
· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Code
#18
All-in-one-B
SOTA
0.443
Accuracy
· Extra Data
· 2022-03-14
All in One: Exploring Unified Video-Language Pre-training
Code
#19
OmniVL
0.441
Accuracy
· Extra Data
· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#20
Clover
0.441
Accuracy
· Extra Data
· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Code
#21
AIO+MIF
0.44
Accuracy
· 2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Code
#22
AIO+MDF
0.438
Accuracy
· 2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Code
#23
GIT+MDF
0.423
Accuracy
· 2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Code
#24
ALPRO
SOTA
0.421
Accuracy
· Extra Data
· 2021-12-17
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Code
#25
LRCE
0.42
Accuracy
No paper
Code
#26
JustAsk+
0.418
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#27
Just Ask
SOTA
0.415
Accuracy
· 2020-12-01
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Code
#28
All-in-one+
0.395
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#29
CLIPBERT
0.374
Accuracy
· Extra Data
· 2021-02-11
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Code
#30
HCRN
SOTA
0.356
Accuracy
· 2020-02-25
Hierarchical Conditional Relation Networks for Video Question Answering
Code
#31
DualVGR
0.355
Accuracy
· 2021-07-10
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
Code
#32
SSML
0.35
Accuracy
· 2020-03-06
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
Code
#33
HMEMA
SOTA
0.33
Accuracy
· 2019-04-08
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
Code
#34
Co-Mem
SOTA
0.32
Accuracy
· 2018-03-29
Motion-Appearance Co-Memory Networks for Video Question Answering
#35
Flamingo (32-shot)
0.31
Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#36
ST-VQA
SOTA
0.309
Accuracy
· 2017-04-14
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Code
#37
Flamingo (0-shot)
0.174
Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code