TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering (VQA)/VQA v2 test-dev

Visual Question Answering (VQA) on VQA v2 test-dev

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1PaLI84.3NoPaLI: A Jointly-Scaled Multilingual Language-Ima...2022-09-14Code
2BEiT-384.19NoImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
3VLMo82.78NoVLMo: Unified Vision-Language Pre-Training with ...2021-11-03Code
4ONE-PEACE82.6NoONE-PEACE: Exploring One General Representation ...2023-05-18Code
5mPLUG (Huge)82.43NomPLUG: Effective and Efficient Vision-Language L...2022-05-24Code
6BLIP-2 ViT-G OPT 6.7B (fine-tuned)82.3NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
7CoCa82.3NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
8CuMo-7B82.2YesCuMo: Scaling Multimodal LLM with Co-Upcycled Mi...2024-05-09Code
9OFA82NoOFA: Unifying Architectures, Tasks, and Modaliti...2022-02-07Code
10X2-VLM (large)81.9NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
11BLIP-2 ViT-G OPT 2.7B (fine-tuned)81.74NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
12BLIP-2 ViT-G FlanT5 XL (fine-tuned)81.66NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
13MMU81.26NoAchieving Human Parity on Visual Question Answer...2021-11-17-
14Lyrics81.2NoLyrics: Boosting Fine-grained Language-Vision Al...2023-12-08-
15InternVL-C81.2NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
16mPLUG-281.11NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
17X2-VLM (base)80.4NoX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
18XFM (base)80.4NoToward Building General Foundation Models for La...2023-01-12Code
19VAST80.23Yes---
20Florence80.16NoFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
21SimVLM80.03NoSimVLM: Simple Visual Language Model Pretraining...2021-08-24Code
22VALOR78.46YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
23Prismer78.43NoPrismer: A Vision-Language Model with Multi-Task...2023-03-04Code
24X-VLM (base)78.22NoMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
25VK-OOD77.9No--Code
26Aurora (ours, r=64)77.69No---
27VK-OOD76.8NoDifferentiable Outlier Detection Enable Robust D...2023-02-11Code
28ALBEF (14M)75.84NoAlign before Fuse: Vision and Language Represent...2021-07-16Code
29Oscar73.82NoOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
30UNITER (Large)73.24NoUNITER: UNiversal Image-TExt Representation Lear...2019-09-25Code
31X-101 grid features + MCAN72.59NoIn Defense of Grid Features for Visual Question ...2020-01-10Code
32CFR72.5NoCoarse-to-Fine Reasoning for Visual Question Ans...2021-10-06Code
33VL-BERTLARGE71.79NoVL-BERT: Pre-training of Generic Visual-Linguist...2019-08-22Code
34ViLT-B/3271.26NoViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
35MCAN+VC71.21NoVisual Commonsense R-CNN2020-02-27Code
36VL-BERTBASE71.16NoVL-BERT: Pre-training of Generic Visual-Linguist...2019-08-22Code
37VisualBERT70.8NoVisualBERT: A Simple and Performant Baseline for...2019-08-09Code
38LXMERT (low-magnitude pruning)70.72NoLXMERT Model Compression for Visual Question Ans...2023-10-23Code
39MCANed-670.63NoDeep Modular Co-Attention Networks for Visual Qu...2019-06-25Code
40ViLBERT70.55NoViLBERT: Pretraining Task-Agnostic Visiolinguist...2019-08-06Code
41BAN+Glove+Counter70.04NoBilinear Attention Networks2018-05-21Code
42LXMERT (Pre-train + scratch)69.9NoLXMERT: Learning Cross-Modality Encoder Represen...2019-08-20Code
43Image features from bottom-up attention (adaptive K, ensemble)69.87NoTips and Tricks for Visual Question Answering: L...2017-08-09Code
44Pythia v0.3 + LoRRA69.21NoTowards VQA Models That Can Read2019-04-18Code
45DMN68.09NoLearning to Count Objects in Natural Images for ...2018-02-15Code
46LaKo68.07NoLaKo: Knowledge-driven Visual Question Answering...2022-07-26Code
47MuRel68.03NoMUREL: Multimodal Relational Reasoning for Visua...2019-02-25Code
48BLOCK67.58NoBLOCK: Bilinear Superdiagonal Fusion for Visual ...2019-01-31Code
49MUTAN67.42NoMUTAN: Multimodal Tucker Fusion for Visual Quest...2017-05-18Code
50BAN2-CTI67.4NoCompact Trilinear Interaction for Visual Questio...2019-09-26Code
512D continuous softmax65.96NoSparse and Continuous Attention Mechanisms2020-06-12Code
52BLIP-2 ViT-G FlanT5 XXL (zero-shot)65NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
53N2NMN (ResNet-152, policy search)64.9NoLearning to Reason: End-to-End Module Networks f...2017-04-18Code
54PNP-VQA64.8NoPlug-and-Play VQA: Zero-shot VQA by Conjoining L...2022-10-17Code
55MCB64.7NoMultimodal Compact Bilinear Pooling for Visual Q...2016-06-06Code
56RUBi63.18NoRUBi: Reducing Unimodal Biases in Visual Questio...2019-06-24Code
57BLIP-2 ViT-G FlanT5 XL (zero-shot)63NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
58BLIP-2 ViT-L FlanT5 XL (zero-shot)62.3NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
59Flamingo 80B56.3NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
60LocVLM-L56.2NoLearning to Localize Objects Improves Spatial Re...2024-04-11Code
61BLIP-2 ViT-G OPT 6.7B (zero-shot)52.6NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
62BLIP-2 ViT-G OPT 2.7B (zero-shot)52.3NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
63Flamingo 9B51.8NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
64KOSMOS-1 1.6B (zero-shot)51No---
65BLIP-2 ViT-L OPT 2.7B (zero-shot)49.7NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
66Flamingo 3B49.2NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
67VLKD44.5No---