Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
MSVD-QA
Visual Question Answering (VQA) on MSVD-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
VLAB
0.61
Yes
VLAB: Enhancing Video Language Pre-training by F...
2023-05-22
-
2
MA-LMM
0.606
No
MA-LMM: Memory-Augmented Large Multimodal Model ...
2024-04-08
Code
3
MaMMUT (ours)
0.602
Yes
MaMMUT: A Simple Architecture for Joint Learning...
2023-03-29
Code
4
VALOR
0.6
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
5
VAST
0.6
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
6
COSA
0.6
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
7
mPLUG-2
0.581
Yes
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
8
VideoCoCa
0.569
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
9
GIT
0.568
Yes
GIT: A Generative Image-to-text Transformer for ...
2022-05-27
Code
10
FrozenBiLM+
0.558
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
11
HiTeA
0.556
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
12
InternVideo
0.555
Yes
InternVideo: General Video Foundation Models via...
2022-12-06
Code
13
UMT-L (ViT-L/16)
0.552
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
14
vid-TLDR (UMT-L)
0.549
Yes
vid-TLDR: Training Free Token merging for Light-...
2024-03-20
Code
15
FrozenBiLM
0.548
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
16
VIOLETv2
0.547
Yes
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
17
MuLTI
0.547
Yes
MuLTI: Efficient Video-and-Language Understandin...
2023-03-10
-
18
X2-VLM (large)
0.546
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
19
X2-VLM (base)
0.528
Yes
X$^2$-VLM: All-In-One Pre-trained Model For Visi...
2022-11-22
Code
20
Clover
0.524
Yes
Clover: Towards A Unified Video-Language Alignme...
2022-07-16
Code
21
VIOLET + MELTR
0.517
Yes
MELTR: Meta Loss Transformer for Learning to Fin...
2023-03-23
Code
22
OmniVL
0.51
Yes
OmniVL:One Foundation Model for Image-Language a...
2022-09-15
-
23
VIOLET+
0.495
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
24
Co-Tokenization
0.486
Yes
Video Question Answering with Iterative Video-Te...
2022-08-01
-
25
All-in-one-B
0.483
Yes
All in One: Exploring Unified Video-Language Pre...
2022-03-14
Code
26
LRCE
0.478
No
-
-
Code
27
JustAsk+
0.477
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
28
GIT+MDF
0.469
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
29
AIO+MIF
0.467
No
Self-Adaptive Sampling for Efficient Video Quest...
2023-07-09
Code
30
Just Ask
0.463
No
Just Ask: Learning to Answer Questions from Mill...
2020-12-01
Code
31
ALPRO
0.459
Yes
Align and Prompt: Video-and-Language Pre-trainin...
2021-12-17
Code
32
All-in-one+
0.438
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
33
DualVGR
0.39
No
DualVGR: A Dual-Visual Graph Reasoning Unit for ...
2021-07-10
Code
34
HCRN
0.361
No
Hierarchical Conditional Relation Networks for V...
2020-02-25
Code
35
SSML
0.351
No
Noise Estimation Using Density Estimation for Se...
2020-03-06
Code
36
HMEMA
0.337
No
Heterogeneous Memory Enhanced Multimodal Attenti...
2019-04-08
Code
37
Co-Mem
0.317
No
Motion-Appearance Co-Memory Networks for Video Q...
2018-03-29
-
38
ST-VQA
0.313
No
TGIF-QA: Toward Spatio-Temporal Reasoning in Vis...
2017-04-14
Code