TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/ViLBERT

ViLBERT

Reported on 73 benchmarks across 8 tasks · 5 papers · 2 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing44 results

  • Visual Question Answering (VQA)onA-OKVQA
    DA VQA Score· 2019-08-06
    25.9
    best: 70.55 (SMoLA-PaLI-X Specialist Model)
    SOTA
    ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksarXiv:1908.02265
  • Inductive knowledge graph completiononMARS (Multimodal Analogical Reasoning dataSet)
    MRR· 2022-10-01
    0.287
    best: 0.341 (MarT_MKGformer)
    Multimodal Analogical Reasoning over Knowledge GraphsarXiv:2210.00312
  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    55.9
    best: 71.5 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    86.9
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    51.8
    best: 66.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    73.7
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    50.7
    best: 64.9 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    58.6
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    Accuracy (%)· 2021-12-14
    50.4
    best: 52.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    pairwise accuracy· 2021-12-14
    68.3
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    50
    best: 54.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    48.1
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    50.6
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    62.9
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    Accuracy (%)· 2021-12-14
    2.4
    best: 89 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    pairwise accuracy· 2021-12-14
    66.5
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    50
    best: 54.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    47.2
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    49.9
    best: 53.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    57.2
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    Accuracy (%)· 2021-12-14
    50.3
    best: 62 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    pairwise accuracy· 2021-12-14
    61.2
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    Accuracy (%)· 2021-12-14
    52.6
    best: 57.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    pairwise accuracy· 2021-12-14
    70.7
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    Average Accuracy· 2021-12-14
    51.3
    best: 63.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    average pairwise accuracy· 2021-12-14
    63.7
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Alg.)· 2021-10-25
    50.62
    best: 56.73 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Com.)· 2021-10-25
    75.6
    best: 87 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Cou.)· 2021-10-25
    71.05
    best: 77.81 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Est.)· 2021-10-25
    99.22
    best: 99.54 (Top-Down)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Fra.)· 2021-10-25
    74.09
    best: 82.13 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Geo.)· 2021-10-25
    80.05
    best: 82.61 (ViLT)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Mea.)· 2021-10-25
    99.07
    best: 99.46 (Top-Down)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Pat.)· 2021-10-25
    62.78
    best: 68.75 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Pro.)· 2021-10-25
    70.94
    best: 95.73 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Sce.)· 2021-10-25
    58.52
    best: 68.8 (ViT)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Sen.)· 2021-10-25
    81.78
    best: 92.49 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Spa.)· 2021-10-25
    49.46
    best: 55.62 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Reasoning (Tim.)· 2021-10-25
    66.72
    best: 77.98 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Sub-tasks (Blank)· 2021-10-25
    77.08
    best: 83.62 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Sub-tasks (Img.)· 2021-10-25
    76.66
    best: 82.66 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onIconQA
    Sub-tasks (Txt.)· 2021-10-25
    70.47
    best: 75.19 (Patch-TRM)
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningarXiv:2110.13214
  • Visual Question Answering (VQA)onA-OKVQA
    MC Accuracy· 2019-08-06
    41.5
    best: 83.75 (SMoLA-PaLI-X Specialist Model)
    ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksarXiv:1908.02265
  • Visual Question Answering (VQA)onVQA v2 test-dev
    Accuracy· 2019-08-06
    70.55
    best: 84.3 (PaLI)
    ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksarXiv:1908.02265

Methodology25 results

  • Large Language ModelonMARS (Multimodal Analogical Reasoning dataSet)
    MRR· 2022-10-01
    0.287
    best: 0.341 (MarT_MKGformer)
    Multimodal Analogical Reasoning over Knowledge GraphsarXiv:2210.00312
  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    55.9
    best: 71.5 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    86.9
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    51.8
    best: 66.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    73.7
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    50.7
    best: 64.9 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    58.6
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    Accuracy (%)· 2021-12-14
    50.4
    best: 52.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    pairwise accuracy· 2021-12-14
    68.3
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    50
    best: 54.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    48.1
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    50.6
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    62.9
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    Accuracy (%)· 2021-12-14
    2.4
    best: 89 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    pairwise accuracy· 2021-12-14
    66.5
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    50
    best: 54.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    47.2
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    49.9
    best: 53.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    57.2
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    Accuracy (%)· 2021-12-14
    50.3
    best: 62 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    pairwise accuracy· 2021-12-14
    61.2
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    Accuracy (%)· 2021-12-14
    52.6
    best: 57.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    pairwise accuracy· 2021-12-14
    70.7
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    Average Accuracy· 2021-12-14
    51.3
    best: 63.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    average pairwise accuracy· 2021-12-14
    63.7
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566

Reasoning2 results

  • Visual ReasoningonGD-VCR
    Gap (West)· 2021-09-14
    -7.28
    SOTA
    Broaden the Vision: Geo-Diverse Visual Commonsense ReasoningarXiv:2109.06860
  • Visual ReasoningonGD-VCR
    Accuracy· 2021-09-14
    59.99
    best: 88.84 (Human)
    Broaden the Vision: Geo-Diverse Visual Commonsense ReasoningarXiv:2109.06860

Knowledge Base2 results

  • Knowledge GraphsonMARS (Multimodal Analogical Reasoning dataSet)
    MRR· 2022-10-01
    0.287
    best: 0.341 (MarT_MKGformer)
    Multimodal Analogical Reasoning over Knowledge GraphsarXiv:2210.00312
  • Knowledge Graph CompletiononMARS (Multimodal Analogical Reasoning dataSet)
    MRR· 2022-10-01
    0.287
    best: 0.341 (MarT_MKGformer)
    Multimodal Analogical Reasoning over Knowledge GraphsarXiv:2210.00312