Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/VisualBERT

VisualBERT

Reported on 66 benchmarks across 5 tasks · 4 papers · 13 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing38 results

Visual Question Answering (VQA)onVCR (Q-AR) test
Accuracy· 2019-08-09
52.4
best: 81.6 (GPT4RoI)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Visual Question Answering (VQA)onVCR (Q-AR) dev
Accuracy· 2019-08-09
52.2
best: 58.9 (VL-BERTLARGE)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Visual Question Answering (VQA)onVCR (Q-A) dev
Accuracy· 2019-08-09
70.8
best: 75.5 (VL-BERTLARGE)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Visual Question Answering (VQA)onVCR (QA-R) dev
Accuracy· 2019-08-09
73.2
best: 77.9 (VL-BERTLARGE)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Visual Question Answering (VQA)onVCR (QA-R) test
Accuracy· 2019-08-09
73.2
best: 91 (GPT4RoI)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Visual Question Answering (VQA)onVCR (Q-A) test
Accuracy· 2019-08-09
71.6
best: 89.4 (GPT4RoI)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Visual Question Answering (VQA)onVQA v2 test-dev
Accuracy· 2019-08-09
70.8
best: 84.3 (PaLI)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Phrase GroundingonFlickr30k Entities Dev
R@1· 2019-08-09
70.4
best: 87.1 (Fiber-B)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Phrase GroundingonFlickr30k Entities Dev
R@10· 2019-08-09
86.31
best: 97.4 (Fiber-B)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Phrase GroundingonFlickr30k Entities Dev
R@5· 2019-08-09
84.49
best: 96.1 (Fiber-B)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Phrase GroundingonFlickr30k Entities Test
R@10· 2019-08-09
86.51
best: 98.1 (GLIP)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Phrase GroundingonFlickr30k Entities Test
R@5· 2019-08-09
84.98
best: 96.9 (GLIP)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
Accuracy (%)· 2021-12-14
46.6
best: 71.5 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
pairwise accuracy· 2021-12-14
48.5
best: 88.8 (CLIP)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE counting adversarial
Accuracy (%)· 2021-12-14
50
best: 66.7 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE counting adversarial
pairwise accuracy· 2021-12-14
50
best: 77.3 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE counting balanced
Accuracy (%)· 2021-12-14
48.3
best: 64.9 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE counting balanced
pairwise accuracy· 2021-12-14
48.2
best: 76.7 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE actant swap
Accuracy (%)· 2021-12-14
49.7
best: 52.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE actant swap
pairwise accuracy· 2021-12-14
44.4
best: 76.9 (GPT2)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE coreference clean
Accuracy (%)· 2021-12-14
50
best: 54.3 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE coreference clean
pairwise accuracy· 2021-12-14
47.6
best: 69.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE counting small numbers
Accuracy (%)· 2021-12-14
47.8
best: 69.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE counting small numbers
pairwise accuracy· 2021-12-14
48.2
best: 80.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE existence
Accuracy (%)· 2021-12-14
49.3
best: 89 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE existence
pairwise accuracy· 2021-12-14
39.7
best: 95.6 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE coreference standard
Accuracy (%)· 2021-12-14
50
best: 54.4 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE coreference standard
pairwise accuracy· 2021-12-14
49.5
best: 75.7 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE spatial relations
Accuracy (%)· 2021-12-14
49.3
best: 53.4 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE spatial relations
pairwise accuracy· 2021-12-14
39.7
best: 77.2 (GPT1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE plurality
Accuracy (%)· 2021-12-14
46.5
best: 62 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE plurality
pairwise accuracy· 2021-12-14
45.7
best: 72.4 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE action replacement
Accuracy (%)· 2021-12-14
48.8
best: 57.3 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE action replacement
pairwise accuracy· 2021-12-14
49.2
best: 75.6 (CLIP)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE
Average Accuracy· 2021-12-14
48.8
best: 63.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Text and Image ClassificationonVALSE
average pairwise accuracy· 2021-12-14
46.4
best: 75.1 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Visual Question Answering (VQA)onVQA v2 test-std
overall· 2019-08-09
71
best: 84.03 (BEiT-3)
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Phrase GroundingonFlickr30k Entities Test
R@1· 2019-08-09
71.33
best: 87.7 (GLIPv2)
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557

Methodology24 results

Multimodal Deep LearningonVALSE foil-it (noun phrases)
Accuracy (%)· 2021-12-14
46.6
best: 71.5 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE foil-it (noun phrases)
pairwise accuracy· 2021-12-14
48.5
best: 88.8 (CLIP)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE counting adversarial
Accuracy (%)· 2021-12-14
50
best: 66.7 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE counting adversarial
pairwise accuracy· 2021-12-14
50
best: 77.3 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE counting balanced
Accuracy (%)· 2021-12-14
48.3
best: 64.9 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE counting balanced
pairwise accuracy· 2021-12-14
48.2
best: 76.7 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE actant swap
Accuracy (%)· 2021-12-14
49.7
best: 52.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE actant swap
pairwise accuracy· 2021-12-14
44.4
best: 76.9 (GPT2)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE coreference clean
Accuracy (%)· 2021-12-14
50
best: 54.3 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE coreference clean
pairwise accuracy· 2021-12-14
47.6
best: 69.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE counting small numbers
Accuracy (%)· 2021-12-14
47.8
best: 69.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE counting small numbers
pairwise accuracy· 2021-12-14
48.2
best: 80.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE existence
Accuracy (%)· 2021-12-14
49.3
best: 89 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE existence
pairwise accuracy· 2021-12-14
39.7
best: 95.6 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE coreference standard
Accuracy (%)· 2021-12-14
50
best: 54.4 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE coreference standard
pairwise accuracy· 2021-12-14
49.5
best: 75.7 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE spatial relations
Accuracy (%)· 2021-12-14
49.3
best: 53.4 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE spatial relations
pairwise accuracy· 2021-12-14
39.7
best: 77.2 (GPT1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE plurality
Accuracy (%)· 2021-12-14
46.5
best: 62 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE plurality
pairwise accuracy· 2021-12-14
45.7
best: 72.4 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE action replacement
Accuracy (%)· 2021-12-14
48.8
best: 57.3 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE action replacement
pairwise accuracy· 2021-12-14
49.2
best: 75.6 (CLIP)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE
Average Accuracy· 2021-12-14
48.8
best: 63.2 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566
Multimodal Deep LearningonVALSE
average pairwise accuracy· 2021-12-14
46.4
best: 75.1 (ViLBERT 12-in-1)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena arXiv:2112.07566

Reasoning4 results

Visual ReasoningonNLVR2 Dev
Accuracy· 2019-08-09
66.7
best: 91.51 (BEiT-3)
SOTA
VisualBERT: A Simple and Performant Baseline for Vision and Language arXiv:1908.03557
Visual ReasoningonVSR
accuracy· 2022-04-30
55.2
best: 70.1 (LXMERT)
Visual Spatial Reasoning arXiv:2205.00363
Visual ReasoningonGD-VCR
Accuracy· 2021-09-14
53.95
best: 88.84 (Human)
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning arXiv:2109.06860
Visual ReasoningonGD-VCR
Gap (West)· 2021-09-14
-10.42
best: -7.28 (ViLBERT)
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning arXiv:2109.06860