Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
MSVD-QA
Visual Question Answering (VQA) on MSVD-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
VLAB
0.61
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
2
MA-LMM
0.606
No
MA-LMM: Memory-Augmented Large Multimodal Model ...
2024-04-08
Code
3
MaMMUT (ours)
0.602
Yes
MaMMUT: A Simple Architecture for Joint Learning...
2023-03-29
Code
4
VALOR
0.6
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
5
VAST
0.6
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
6
COSA
0.6
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
7
mPLUG-2
0.581
Yes
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
8
VideoCoCa
0.569
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
9
GIT
0.568
Yes
GIT: A Generative Image-to-text Transformer for ...
2022-05-27
Code
10
FrozenBiLM+
0.558
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
11
HiTeA
0.556
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
12
InternVideo
0.555
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
13
UMT-L (ViT-L/16)
0.552
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
14
vid-TLDR (UMT-L)
0.549
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
15
FrozenBiLM
0.548
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
16
VIOLETv2
0.547
Yes
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
17
MuLTI
0.547
Yes
MuLTI: Efficient Video-and-Language Understandin...
2023-03-10
-
18
X2-VLM (large)
0.546
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
19
X2-VLM (base)
0.528
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
20
Clover
0.524
Yes
Clover: Towards A Unified Video-Language Alignme...
2022-07-16
Code
21
VIOLET + MELTR
0.517
Yes
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
22
OmniVL
0.51
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
23
VIOLET+
0.495
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
24
Co-Tokenization
0.486
Yes
Video Question Answering with Iterative Video-Te...
2022-08-01
-
25
All-in-one-B
0.483
Yes
All in One: Exploring Unified Video-Language Pre...
2022-03-14
Code
26
LRCE
0.478
No
-
-
Code
27
JustAsk+
0.477
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
28
GIT+MDF
0.469
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
29
AIO+MIF
0.467
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
30
Just Ask
0.463
No
Just Ask: Learning to Answer Questions from Mill...
2020-12-01
Code
31
ALPRO
0.459
Yes
Align and Prompt: Video-and-Language Pre-trainin...
2021-12-17
Code
32
All-in-one+
0.438
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
33
DualVGR
0.39
No
DualVGR: A Dual-Visual Graph Reasoning Unit for ...
2021-07-10
Code
34
HCRN
0.361
No
Hierarchical Conditional Relation Networks for V...
2020-02-25
Code
35
SSML
0.351
No
Noise Estimation Using Density Estimation for Se...
2020-03-06
Code
36
HMEMA
0.337
No
Heterogeneous Memory Enhanced Multimodal Attenti...
2019-04-08
Code
37
Co-Mem
0.317
No
Motion-Appearance Co-Memory Networks for Video Q...
2018-03-29
-
38
ST-VQA
0.313
No
TGIF-QA: Toward Spatio-Temporal Reasoning in Vis...
2017-04-14
Code
#1
VLAB
SOTA
0.61
Accuracy
· Extra Data
· 2023-05-22
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
#2
MA-LMM
0.606
Accuracy
· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Code
#3
MaMMUT (ours)
SOTA
0.602
Accuracy
· Extra Data
· 2023-03-29
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Code
#4
VALOR
0.6
Accuracy
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#5
VAST
0.6
Accuracy
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#6
COSA
0.6
Accuracy
· Extra Data
· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Code
#7
mPLUG-2
SOTA
0.581
Accuracy
· Extra Data
· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Code
#8
VideoCoCa
SOTA
0.569
Accuracy
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#9
GIT
SOTA
0.568
Accuracy
· Extra Data
· 2022-05-27
GIT: A Generative Image-to-text Transformer for Vision and Language
Code
#10
FrozenBiLM+
0.558
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#11
HiTeA
0.556
Accuracy
· Extra Data
· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#12
InternVideo
0.555
Accuracy
· Extra Data
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#13
UMT-L (ViT-L/16)
0.552
Accuracy
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#14
vid-TLDR (UMT-L)
0.549
Accuracy
· Extra Data
· 2024-03-20
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Code
#15
FrozenBiLM
0.548
Accuracy
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code
#16
VIOLETv2
0.547
Accuracy
· Extra Data
· 2022-09-04
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Code
#17
MuLTI
0.547
Accuracy
· Extra Data
· 2023-03-10
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
#18
X2-VLM (large)
0.546
Accuracy
· Extra Data
· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Code
#19
X2-VLM (base)
0.528
Accuracy
· Extra Data
· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Code
#20
Clover
0.524
Accuracy
· Extra Data
· 2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Code
#21
VIOLET + MELTR
0.517
Accuracy
· Extra Data
· 2023-03-23
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Code
#22
OmniVL
0.51
Accuracy
· Extra Data
· 2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
#23
VIOLET+
0.495
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#24
Co-Tokenization
0.486
Accuracy
· Extra Data
· 2022-08-01
Video Question Answering with Iterative Video-Text Co-Tokenization
#25
All-in-one-B
SOTA
0.483
Accuracy
· Extra Data
· 2022-03-14
All in One: Exploring Unified Video-Language Pre-training
Code
#26
LRCE
0.478
Accuracy
No paper
Code
#27
JustAsk+
0.477
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#28
GIT+MDF
0.469
Accuracy
· 2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Code
#29
AIO+MIF
0.467
Accuracy
· 2023-07-09
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Code
#30
Just Ask
SOTA
0.463
Accuracy
· 2020-12-01
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Code
#31
ALPRO
0.459
Accuracy
· Extra Data
· 2021-12-17
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Code
#32
All-in-one+
0.438
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#33
DualVGR
0.39
Accuracy
· 2021-07-10
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
Code
#34
HCRN
SOTA
0.361
Accuracy
· 2020-02-25
Hierarchical Conditional Relation Networks for Video Question Answering
Code
#35
SSML
0.351
Accuracy
· 2020-03-06
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
Code
#36
HMEMA
SOTA
0.337
Accuracy
· 2019-04-08
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
Code
#37
Co-Mem
SOTA
0.317
Accuracy
· 2018-03-29
Motion-Appearance Co-Memory Networks for Video Question Answering
#38
ST-VQA
SOTA
0.313
Accuracy
· 2017-04-14
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Code