TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/ViLBERT 12-in-1

ViLBERT 12-in-1

Reported on 48 benchmarks across 2 tasks · 1 paper · 40 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Methodology24 results

  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    71.5
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    66.7
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    77.3
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    64.9
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    76.7
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    Accuracy (%)· 2021-12-14
    52.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    54.3
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    69.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    69.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    80.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    Accuracy (%)· 2021-12-14
    89
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    pairwise accuracy· 2021-12-14
    95.6
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    54.4
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    75.7
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    53.4
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    Accuracy (%)· 2021-12-14
    62
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    pairwise accuracy· 2021-12-14
    72.4
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    Accuracy (%)· 2021-12-14
    57.3
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    Average Accuracy· 2021-12-14
    63.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    average pairwise accuracy· 2021-12-14
    75.1
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    86.9
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    pairwise accuracy· 2021-12-14
    58.9
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    67.7
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    pairwise accuracy· 2021-12-14
    65.9
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566

Natural Language Processing24 results

  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    71.5
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    66.7
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    77.3
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    64.9
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    76.7
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    Accuracy (%)· 2021-12-14
    52.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    54.3
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    69.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    69.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    80.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    Accuracy (%)· 2021-12-14
    89
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    pairwise accuracy· 2021-12-14
    95.6
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    54.4
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    75.7
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    53.4
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    Accuracy (%)· 2021-12-14
    62
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    pairwise accuracy· 2021-12-14
    72.4
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    Accuracy (%)· 2021-12-14
    57.3
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    Average Accuracy· 2021-12-14
    63.2
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    average pairwise accuracy· 2021-12-14
    75.1
    SOTA
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    86.9
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    pairwise accuracy· 2021-12-14
    58.9
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    67.7
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    pairwise accuracy· 2021-12-14
    65.9
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566