TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering (VQA)/VQA v2 test-std

Visual Question Answering (VQA) on VQA v2 test-std

Metric: overall (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕overall▼Extra DataPaperDate↕Code
1BEiT-384.03NoImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
2mPLUG-Huge83.62NomPLUG: Effective and Efficient Vision-Language L...2022-05-24Code
3ONE-PEACE82.52NoONE-PEACE: Exploring One General Representation ...2023-05-18Code
4OFA81.98NoOFA: Unifying Architectures, Tasks, and Modaliti...2022-02-07Code
5X2-VLM (large)81.8NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
6VLMo81.3NoVLMo: Unified Vision-Language Pre-Training with ...2021-11-03Code
7Florence80.36NoFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
8SimVLM80.34NoSimVLM: Simple Visual Language Model Pretraining...2021-08-24Code
9X2-VLM (base)80.2NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
10VAST80.19Yes---
11VALOR78.62YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
12Prompt Tuning78.53NoPrompt Tuning for Generative Multimodal Pretrain...2022-08-04Code
13Prismer78.49NoPrismer: A Vision-Language Model with Multi-Task...2023-03-04Code
14MSR + MS Cog. Svcs., X10 models77.45NoVinVL: Revisiting Visual Representations in Visi...2021-01-02Code
15MSR + MS Cog. Svcs.76.63NoVinVL: Revisiting Visual Representations in Visi...2021-01-02Code
16ALBEF (14M)76.04NoAlign before Fuse: Vision and Language Represent...2021-07-16Code
17BGN, ensemble75.92NoBilinear Graph Networks for Visual Question Answ...2019-07-23-
18ERNIE-ViL-single model74.93NoERNIE-ViL: Knowledge Enhanced Vision-Language Re...2020-06-30-
19Single, w/o VLP74.16NoIn Defense of Grid Features for Visual Question ...2020-01-10Code
20Single, w/o VLP73.86NoDeep Multimodal Neural Architecture Search2020-04-25Code
21UNITER (Large)73.4NoUNITER: UNiversal Image-TExt Representation Lear...2019-09-25Code
22X-101 grid features + MCAN72.71NoIn Defense of Grid Features for Visual Question ...2020-01-10Code
23LXMERT72.5NoLXMERT: Learning Cross-Modality Encoder Represen...2019-08-20Code
24VL-BERTLARGE72.2NoVL-BERT: Pre-training of Generic Visual-Linguist...2019-08-22Code
25MCAN+VC71.49NoVisual Commonsense R-CNN2020-02-27Code
26VisualBERT71NoVisualBERT: A Simple and Performant Baseline for...2019-08-09Code
27MCANed-670.9NoDeep Modular Co-Attention Networks for Visual Qu...2019-06-25Code
28Unified VLP70.7NoUnified Vision-Language Pre-Training for Image C...2019-09-24Code
29BAN+Glove+Counter70.4NoBilinear Attention Networks2018-05-21Code
30Up-Down70.34NoBottom-Up and Top-Down Attention for Image Capti...2017-07-25Code
31Image features from bottom-up attention (adaptive K, ensemble)70.3NoTips and Tricks for Visual Question Answering: L...2017-08-09Code
32Caption VQA69.7NoGenerating Question Relevant Captions to Aid Vis...2019-06-03-
33MuRel68.4NoMUREL: Multimodal Relational Reasoning for Visua...2019-02-25Code
34DMN68.4NoLearning to Count Objects in Natural Images for ...2018-02-15Code
35BLOCK67.9NoBLOCK: Bilinear Superdiagonal Fusion for Visual ...2019-01-31Code
36MUTAN67.4NoMUTAN: Multimodal Tucker Fusion for Visual Quest...2017-05-18Code
372D continuous softmax66.27NoSparse and Continuous Attention Mechanisms2020-06-12Code
38MCB [11, 12]62.27NoMaking the V in VQA Matter: Elevating the Role o...2016-12-02Code
39Language-only44.26NoMaking the V in VQA Matter: Elevating the Role o...2016-12-02Code
40Prior25.98NoMaking the V in VQA Matter: Elevating the Role o...2016-12-02Code