TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/VisualBERT

VisualBERT

Reported on 66 benchmarks across 5 tasks · 4 papers · 13 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing38 results

  • Visual Question Answering (VQA)onVCR (Q-AR) test
    Accuracy· 2019-08-09
    52.4
    best: 81.6 (GPT4RoI)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Visual Question Answering (VQA)onVCR (Q-AR) dev
    Accuracy· 2019-08-09
    52.2
    best: 58.9 (VL-BERTLARGE)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Visual Question Answering (VQA)onVCR (Q-A) dev
    Accuracy· 2019-08-09
    70.8
    best: 75.5 (VL-BERTLARGE)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Visual Question Answering (VQA)onVCR (QA-R) dev
    Accuracy· 2019-08-09
    73.2
    best: 77.9 (VL-BERTLARGE)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Visual Question Answering (VQA)onVCR (QA-R) test
    Accuracy· 2019-08-09
    73.2
    best: 91 (GPT4RoI)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Visual Question Answering (VQA)onVCR (Q-A) test
    Accuracy· 2019-08-09
    71.6
    best: 89.4 (GPT4RoI)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Visual Question Answering (VQA)onVQA v2 test-dev
    Accuracy· 2019-08-09
    70.8
    best: 84.3 (PaLI)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Phrase GroundingonFlickr30k Entities Dev
    R@1· 2019-08-09
    70.4
    best: 87.1 (Fiber-B)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Phrase GroundingonFlickr30k Entities Dev
    R@10· 2019-08-09
    86.31
    best: 97.4 (Fiber-B)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Phrase GroundingonFlickr30k Entities Dev
    R@5· 2019-08-09
    84.49
    best: 96.1 (Fiber-B)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Phrase GroundingonFlickr30k Entities Test
    R@10· 2019-08-09
    86.51
    best: 98.1 (GLIP)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Phrase GroundingonFlickr30k Entities Test
    R@5· 2019-08-09
    84.98
    best: 96.9 (GLIP)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    46.6
    best: 71.5 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    48.5
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    50
    best: 66.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    50
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    48.3
    best: 64.9 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    48.2
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    Accuracy (%)· 2021-12-14
    49.7
    best: 52.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    pairwise accuracy· 2021-12-14
    44.4
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    50
    best: 54.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    47.6
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    47.8
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    48.2
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    Accuracy (%)· 2021-12-14
    49.3
    best: 89 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    pairwise accuracy· 2021-12-14
    39.7
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    50
    best: 54.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    49.5
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    49.3
    best: 53.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    39.7
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    Accuracy (%)· 2021-12-14
    46.5
    best: 62 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    pairwise accuracy· 2021-12-14
    45.7
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    Accuracy (%)· 2021-12-14
    48.8
    best: 57.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    pairwise accuracy· 2021-12-14
    49.2
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    Average Accuracy· 2021-12-14
    48.8
    best: 63.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    average pairwise accuracy· 2021-12-14
    46.4
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Visual Question Answering (VQA)onVQA v2 test-std
    overall· 2019-08-09
    71
    best: 84.03 (BEiT-3)
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Phrase GroundingonFlickr30k Entities Test
    R@1· 2019-08-09
    71.33
    best: 87.7 (GLIPv2)
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557

Methodology24 results

  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    46.6
    best: 71.5 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    48.5
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    50
    best: 66.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    50
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    48.3
    best: 64.9 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    48.2
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    Accuracy (%)· 2021-12-14
    49.7
    best: 52.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    pairwise accuracy· 2021-12-14
    44.4
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    50
    best: 54.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    47.6
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    47.8
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    48.2
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    Accuracy (%)· 2021-12-14
    49.3
    best: 89 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    pairwise accuracy· 2021-12-14
    39.7
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    50
    best: 54.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    49.5
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    49.3
    best: 53.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    39.7
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    Accuracy (%)· 2021-12-14
    46.5
    best: 62 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    pairwise accuracy· 2021-12-14
    45.7
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    Accuracy (%)· 2021-12-14
    48.8
    best: 57.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    pairwise accuracy· 2021-12-14
    49.2
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    Average Accuracy· 2021-12-14
    48.8
    best: 63.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    average pairwise accuracy· 2021-12-14
    46.4
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566

Reasoning4 results

  • Visual ReasoningonNLVR2 Dev
    Accuracy· 2019-08-09
    66.7
    best: 91.51 (BEiT-3)
    SOTA
    VisualBERT: A Simple and Performant Baseline for Vision and LanguagearXiv:1908.03557
  • Visual ReasoningonVSR
    accuracy· 2022-04-30
    55.2
    best: 70.1 (LXMERT)
    Visual Spatial ReasoningarXiv:2205.00363
  • Visual ReasoningonGD-VCR
    Accuracy· 2021-09-14
    53.95
    best: 88.84 (Human)
    Broaden the Vision: Geo-Diverse Visual Commonsense ReasoningarXiv:2109.06860
  • Visual ReasoningonGD-VCR
    Gap (West)· 2021-09-14
    -10.42
    best: -7.28 (ViLBERT)
    Broaden the Vision: Geo-Diverse Visual Commonsense ReasoningarXiv:2109.06860