TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/LXMERT

LXMERT

Reported on 57 benchmarks across 4 tasks · 4 papers · 2 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Natural Language Processing28 results

  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    70.8
    best: 71.5 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    87.1
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    49.9
    best: 66.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    42.6
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    52
    best: 64.9 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    62.2
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    Accuracy (%)· 2021-12-14
    48.5
    best: 52.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE actant swap
    pairwise accuracy· 2021-12-14
    45.8
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    49
    best: 54.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    44.2
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    55.4
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    69.2
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    Accuracy (%)· 2021-12-14
    55.8
    best: 89 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE existence
    pairwise accuracy· 2021-12-14
    78.6
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    49.8
    best: 54.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    46.8
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    50.8
    best: 53.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    60.2
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    Accuracy (%)· 2021-12-14
    55.1
    best: 62 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE plurality
    pairwise accuracy· 2021-12-14
    64.4
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    Accuracy (%)· 2021-12-14
    51.1
    best: 57.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE action replacement
    pairwise accuracy· 2021-12-14
    54.8
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    Average Accuracy· 2021-12-14
    53.5
    best: 63.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Text and Image ClassificationonVALSE
    average pairwise accuracy· 2021-12-14
    59.6
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Visual Question Answering (VQA)onA-OKVQA
    DA VQA Score· 2019-08-20
    25.9
    best: 70.55 (SMoLA-PaLI-X Specialist Model)
    LXMERT: Learning Cross-Modality Encoder Representations from TransformersarXiv:1908.07490
  • Visual Question Answering (VQA)onA-OKVQA
    MC Accuracy· 2019-08-20
    41.6
    best: 83.75 (SMoLA-PaLI-X Specialist Model)
    LXMERT: Learning Cross-Modality Encoder Representations from TransformersarXiv:1908.07490
  • Visual Question Answering (VQA)onGQA test-std
    Accuracy· uses extra data· 2019-08-20
    60.3
    best: 65.14 (ProTo)
    LXMERT: Learning Cross-Modality Encoder Representations from TransformersarXiv:1908.07490
  • Visual Question Answering (VQA)onVQA v2 test-std
    overall· 2019-08-20
    72.5
    best: 84.03 (BEiT-3)
    LXMERT: Learning Cross-Modality Encoder Representations from TransformersarXiv:1908.07490

Methodology24 results

  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    Accuracy (%)· 2021-12-14
    70.8
    best: 71.5 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE foil-it (noun phrases)
    pairwise accuracy· 2021-12-14
    87.1
    best: 88.8 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    Accuracy (%)· 2021-12-14
    49.9
    best: 66.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting adversarial
    pairwise accuracy· 2021-12-14
    42.6
    best: 77.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    Accuracy (%)· 2021-12-14
    52
    best: 64.9 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting balanced
    pairwise accuracy· 2021-12-14
    62.2
    best: 76.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    Accuracy (%)· 2021-12-14
    48.5
    best: 52.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE actant swap
    pairwise accuracy· 2021-12-14
    45.8
    best: 76.9 (GPT2)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    Accuracy (%)· 2021-12-14
    49
    best: 54.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference clean
    pairwise accuracy· 2021-12-14
    44.2
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    Accuracy (%)· 2021-12-14
    55.4
    best: 69.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE counting small numbers
    pairwise accuracy· 2021-12-14
    69.2
    best: 80.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    Accuracy (%)· 2021-12-14
    55.8
    best: 89 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE existence
    pairwise accuracy· 2021-12-14
    78.6
    best: 95.6 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    Accuracy (%)· 2021-12-14
    49.8
    best: 54.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE coreference standard
    pairwise accuracy· 2021-12-14
    46.8
    best: 75.7 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    Accuracy (%)· 2021-12-14
    50.8
    best: 53.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE spatial relations
    pairwise accuracy· 2021-12-14
    60.2
    best: 77.2 (GPT1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    Accuracy (%)· 2021-12-14
    55.1
    best: 62 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE plurality
    pairwise accuracy· 2021-12-14
    64.4
    best: 72.4 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    Accuracy (%)· 2021-12-14
    51.1
    best: 57.3 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE action replacement
    pairwise accuracy· 2021-12-14
    54.8
    best: 75.6 (CLIP)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    Average Accuracy· 2021-12-14
    53.5
    best: 63.2 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566
  • Multimodal Deep LearningonVALSE
    average pairwise accuracy· 2021-12-14
    59.6
    best: 75.1 (ViLBERT 12-in-1)
    VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaarXiv:2112.07566

Reasoning5 results

  • Visual ReasoningonVSR
    accuracy· 2022-04-30
    70.1
    SOTA
    Visual Spatial ReasoningarXiv:2205.00363
  • Visual ReasoningonNLVR2 Test
    Accuracy· 2019-08-20
    76.2
    best: 92.58 (BEiT-3)
    SOTA
    LXMERT: Learning Cross-Modality Encoder Representations from TransformersarXiv:1908.07490
  • Visual ReasoningonWinoground
    Group Score· 2022-04-07
    4
    best: 58.75 (GPT-4V (CoT, pick b/w two options))
    Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityarXiv:2204.03162
  • Visual ReasoningonWinoground
    Image Score· 2022-04-07
    7
    best: 68.75 (GPT-4V (CoT, pick b/w two options))
    Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityarXiv:2204.03162
  • Visual ReasoningonWinoground
    Text Score· 2022-04-07
    19.25
    best: 75.5 (GPT-4o + CA)
    Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityarXiv:2204.03162