TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VALSE: A Task-Independent Benchmark for Vision and Languag...

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, Albert Gatt

2021-12-14ACL 2022 5image-sentence alignment
PaperPDFCode(official)

Abstract

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

Results

TaskDatasetMetricValueModel
Multimodal Deep LearningVALSE foil-it (noun phrases)pairwise accuracy88.8CLIP
Multimodal Deep LearningVALSE foil-it (noun phrases)Accuracy (%)70.8LXMERT
Multimodal Deep LearningVALSE foil-it (noun phrases)pairwise accuracy87.1LXMERT
Multimodal Deep LearningVALSE foil-it (noun phrases)Accuracy (%)71.5ViLBERT 12-in-1
Multimodal Deep LearningVALSE foil-it (noun phrases)pairwise accuracy86.9ViLBERT 12-in-1
Multimodal Deep LearningVALSE foil-it (noun phrases)Accuracy (%)55.9ViLBERT
Multimodal Deep LearningVALSE foil-it (noun phrases)pairwise accuracy86.9ViLBERT
Multimodal Deep LearningVALSE foil-it (noun phrases)pairwise accuracy80.7GPT2
Multimodal Deep LearningVALSE foil-it (noun phrases)pairwise accuracy77.5GPT1
Multimodal Deep LearningVALSE foil-it (noun phrases)Accuracy (%)46.6VisualBERT
Multimodal Deep LearningVALSE foil-it (noun phrases)pairwise accuracy48.5VisualBERT
Multimodal Deep LearningVALSE counting adversarialAccuracy (%)66.7ViLBERT 12-in-1
Multimodal Deep LearningVALSE counting adversarialpairwise accuracy77.3ViLBERT 12-in-1
Multimodal Deep LearningVALSE counting adversarialAccuracy (%)51.8ViLBERT
Multimodal Deep LearningVALSE counting adversarialpairwise accuracy73.7ViLBERT
Multimodal Deep LearningVALSE counting adversarialpairwise accuracy69.5GPT1
Multimodal Deep LearningVALSE counting adversarialpairwise accuracy57.5CLIP
Multimodal Deep LearningVALSE counting adversarialAccuracy (%)50VisualBERT
Multimodal Deep LearningVALSE counting adversarialpairwise accuracy50VisualBERT
Multimodal Deep LearningVALSE counting adversarialpairwise accuracy45.3GPT2
Multimodal Deep LearningVALSE counting adversarialAccuracy (%)49.9LXMERT
Multimodal Deep LearningVALSE counting adversarialpairwise accuracy42.6LXMERT
Multimodal Deep LearningVALSE counting balancedAccuracy (%)64.9ViLBERT 12-in-1
Multimodal Deep LearningVALSE counting balancedpairwise accuracy76.7ViLBERT 12-in-1
Multimodal Deep LearningVALSE counting balancedAccuracy (%)52LXMERT
Multimodal Deep LearningVALSE counting balancedpairwise accuracy62.2LXMERT
Multimodal Deep LearningVALSE counting balancedpairwise accuracy62.1CLIP
Multimodal Deep LearningVALSE counting balancedAccuracy (%)50.7ViLBERT
Multimodal Deep LearningVALSE counting balancedpairwise accuracy58.6ViLBERT
Multimodal Deep LearningVALSE counting balancedpairwise accuracy51.6GPT2
Multimodal Deep LearningVALSE counting balancedpairwise accuracy51.2GPT1
Multimodal Deep LearningVALSE counting balancedAccuracy (%)48.3VisualBERT
Multimodal Deep LearningVALSE counting balancedpairwise accuracy48.2VisualBERT
Multimodal Deep LearningVALSE actant swappairwise accuracy76.9GPT2
Multimodal Deep LearningVALSE actant swappairwise accuracy72.2GPT1
Multimodal Deep LearningVALSE actant swappairwise accuracy68.6CLIP
Multimodal Deep LearningVALSE actant swapAccuracy (%)50.4ViLBERT
Multimodal Deep LearningVALSE actant swappairwise accuracy68.3ViLBERT
Multimodal Deep LearningVALSE actant swapAccuracy (%)52.2ViLBERT 12-in-1
Multimodal Deep LearningVALSE actant swappairwise accuracy58.9ViLBERT 12-in-1
Multimodal Deep LearningVALSE actant swapAccuracy (%)48.5LXMERT
Multimodal Deep LearningVALSE actant swappairwise accuracy45.8LXMERT
Multimodal Deep LearningVALSE actant swapAccuracy (%)49.7VisualBERT
Multimodal Deep LearningVALSE actant swappairwise accuracy44.4VisualBERT
Multimodal Deep LearningVALSE coreference cleanAccuracy (%)54.3ViLBERT 12-in-1
Multimodal Deep LearningVALSE coreference cleanpairwise accuracy69.2ViLBERT 12-in-1
Multimodal Deep LearningVALSE coreference cleanpairwise accuracy50GPT2
Multimodal Deep LearningVALSE coreference cleanpairwise accuracy49.7CLIP
Multimodal Deep LearningVALSE coreference cleanAccuracy (%)50ViLBERT
Multimodal Deep LearningVALSE coreference cleanpairwise accuracy48.1ViLBERT
Multimodal Deep LearningVALSE coreference cleanAccuracy (%)50VisualBERT
Multimodal Deep LearningVALSE coreference cleanpairwise accuracy47.6VisualBERT
Multimodal Deep LearningVALSE coreference cleanpairwise accuracy45.2GPT1
Multimodal Deep LearningVALSE coreference cleanAccuracy (%)49LXMERT
Multimodal Deep LearningVALSE coreference cleanpairwise accuracy44.2LXMERT
Multimodal Deep LearningVALSE counting small numbersAccuracy (%)69.2ViLBERT 12-in-1
Multimodal Deep LearningVALSE counting small numberspairwise accuracy80.2ViLBERT 12-in-1
Multimodal Deep LearningVALSE counting small numbersAccuracy (%)55.4LXMERT
Multimodal Deep LearningVALSE counting small numberspairwise accuracy69.2LXMERT
Multimodal Deep LearningVALSE counting small numbersAccuracy (%)50.6ViLBERT
Multimodal Deep LearningVALSE counting small numberspairwise accuracy62.9ViLBERT
Multimodal Deep LearningVALSE counting small numberspairwise accuracy62.5CLIP
Multimodal Deep LearningVALSE counting small numberspairwise accuracy49.8GPT2
Multimodal Deep LearningVALSE counting small numberspairwise accuracy48.7GPT1
Multimodal Deep LearningVALSE counting small numbersAccuracy (%)47.8VisualBERT
Multimodal Deep LearningVALSE counting small numberspairwise accuracy48.2VisualBERT
Multimodal Deep LearningVALSE existenceAccuracy (%)89ViLBERT 12-in-1
Multimodal Deep LearningVALSE existencepairwise accuracy95.6ViLBERT 12-in-1
Multimodal Deep LearningVALSE existenceAccuracy (%)55.8LXMERT
Multimodal Deep LearningVALSE existencepairwise accuracy78.6LXMERT
Multimodal Deep LearningVALSE existencepairwise accuracy66.9CLIP
Multimodal Deep LearningVALSE existenceAccuracy (%)2.4ViLBERT
Multimodal Deep LearningVALSE existencepairwise accuracy66.5ViLBERT
Multimodal Deep LearningVALSE existencepairwise accuracy61.8GPT1
Multimodal Deep LearningVALSE existencepairwise accuracy58GPT2
Multimodal Deep LearningVALSE existenceAccuracy (%)49.3VisualBERT
Multimodal Deep LearningVALSE existencepairwise accuracy39.7VisualBERT
Multimodal Deep LearningVALSE coreference standardAccuracy (%)54.4ViLBERT 12-in-1
Multimodal Deep LearningVALSE coreference standardpairwise accuracy75.7ViLBERT 12-in-1
Multimodal Deep LearningVALSE coreference standardpairwise accuracy54.5GPT2
Multimodal Deep LearningVALSE coreference standardpairwise accuracy52.1CLIP
Multimodal Deep LearningVALSE coreference standardAccuracy (%)50VisualBERT
Multimodal Deep LearningVALSE coreference standardpairwise accuracy49.5VisualBERT
Multimodal Deep LearningVALSE coreference standardAccuracy (%)50ViLBERT
Multimodal Deep LearningVALSE coreference standardpairwise accuracy47.2ViLBERT
Multimodal Deep LearningVALSE coreference standardAccuracy (%)49.8LXMERT
Multimodal Deep LearningVALSE coreference standardpairwise accuracy46.8LXMERT
Multimodal Deep LearningVALSE coreference standardpairwise accuracy45.6GPT1
Multimodal Deep LearningVALSE spatial relationspairwise accuracy77.2GPT1
Multimodal Deep LearningVALSE spatial relationspairwise accuracy75GPT2
Multimodal Deep LearningVALSE spatial relationsAccuracy (%)53.4ViLBERT 12-in-1
Multimodal Deep LearningVALSE spatial relationspairwise accuracy67.7ViLBERT 12-in-1
Multimodal Deep LearningVALSE spatial relationspairwise accuracy64.3CLIP
Multimodal Deep LearningVALSE spatial relationsAccuracy (%)50.8LXMERT
Multimodal Deep LearningVALSE spatial relationspairwise accuracy60.2LXMERT
Multimodal Deep LearningVALSE spatial relationsAccuracy (%)49.9ViLBERT
Multimodal Deep LearningVALSE spatial relationspairwise accuracy57.2ViLBERT
Multimodal Deep LearningVALSE spatial relationsAccuracy (%)49.3VisualBERT
Multimodal Deep LearningVALSE spatial relationspairwise accuracy39.7VisualBERT
Multimodal Deep LearningVALSE pluralityAccuracy (%)62ViLBERT 12-in-1
Multimodal Deep LearningVALSE pluralitypairwise accuracy72.4ViLBERT 12-in-1
Multimodal Deep LearningVALSE pluralityAccuracy (%)55.1LXMERT
Multimodal Deep LearningVALSE pluralitypairwise accuracy64.4LXMERT
Multimodal Deep LearningVALSE pluralityAccuracy (%)50.3ViLBERT
Multimodal Deep LearningVALSE pluralitypairwise accuracy61.2ViLBERT
Multimodal Deep LearningVALSE pluralitypairwise accuracy56.2CLIP
Multimodal Deep LearningVALSE pluralitypairwise accuracy53.1GPT1
Multimodal Deep LearningVALSE pluralitypairwise accuracy51.9GPT2
Multimodal Deep LearningVALSE pluralityAccuracy (%)46.5VisualBERT
Multimodal Deep LearningVALSE pluralitypairwise accuracy45.7VisualBERT
Multimodal Deep LearningVALSE action replacementpairwise accuracy75.6CLIP
Multimodal Deep LearningVALSE action replacementAccuracy (%)52.6ViLBERT
Multimodal Deep LearningVALSE action replacementpairwise accuracy70.7ViLBERT
Multimodal Deep LearningVALSE action replacementpairwise accuracy66.8GPT2
Multimodal Deep LearningVALSE action replacementAccuracy (%)57.3ViLBERT 12-in-1
Multimodal Deep LearningVALSE action replacementpairwise accuracy65.9ViLBERT 12-in-1
Multimodal Deep LearningVALSE action replacementpairwise accuracy65.4GPT1
Multimodal Deep LearningVALSE action replacementAccuracy (%)51.1LXMERT
Multimodal Deep LearningVALSE action replacementpairwise accuracy54.8LXMERT
Multimodal Deep LearningVALSE action replacementAccuracy (%)48.8VisualBERT
Multimodal Deep LearningVALSE action replacementpairwise accuracy49.2VisualBERT
Multimodal Deep LearningVALSEAverage Accuracy63.2ViLBERT 12-in-1
Multimodal Deep LearningVALSEaverage pairwise accuracy75.1ViLBERT 12-in-1
Multimodal Deep LearningVALSEaverage pairwise accuracy64CLIP
Multimodal Deep LearningVALSEAverage Accuracy51.3ViLBERT
Multimodal Deep LearningVALSEaverage pairwise accuracy63.7ViLBERT
Multimodal Deep LearningVALSEaverage pairwise accuracy60.7GPT1
Multimodal Deep LearningVALSEaverage pairwise accuracy60.1GPT2
Multimodal Deep LearningVALSEAverage Accuracy53.5LXMERT
Multimodal Deep LearningVALSEaverage pairwise accuracy59.6LXMERT
Multimodal Deep LearningVALSEAverage Accuracy48.8VisualBERT
Multimodal Deep LearningVALSEaverage pairwise accuracy46.4VisualBERT
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)pairwise accuracy88.8CLIP
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)Accuracy (%)70.8LXMERT
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)pairwise accuracy87.1LXMERT
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)Accuracy (%)71.5ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)pairwise accuracy86.9ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)Accuracy (%)55.9ViLBERT
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)pairwise accuracy86.9ViLBERT
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)pairwise accuracy80.7GPT2
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)pairwise accuracy77.5GPT1
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)Accuracy (%)46.6VisualBERT
Multimodal Text and Image ClassificationVALSE foil-it (noun phrases)pairwise accuracy48.5VisualBERT
Multimodal Text and Image ClassificationVALSE counting adversarialAccuracy (%)66.7ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE counting adversarialpairwise accuracy77.3ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE counting adversarialAccuracy (%)51.8ViLBERT
Multimodal Text and Image ClassificationVALSE counting adversarialpairwise accuracy73.7ViLBERT
Multimodal Text and Image ClassificationVALSE counting adversarialpairwise accuracy69.5GPT1
Multimodal Text and Image ClassificationVALSE counting adversarialpairwise accuracy57.5CLIP
Multimodal Text and Image ClassificationVALSE counting adversarialAccuracy (%)50VisualBERT
Multimodal Text and Image ClassificationVALSE counting adversarialpairwise accuracy50VisualBERT
Multimodal Text and Image ClassificationVALSE counting adversarialpairwise accuracy45.3GPT2
Multimodal Text and Image ClassificationVALSE counting adversarialAccuracy (%)49.9LXMERT
Multimodal Text and Image ClassificationVALSE counting adversarialpairwise accuracy42.6LXMERT
Multimodal Text and Image ClassificationVALSE counting balancedAccuracy (%)64.9ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE counting balancedpairwise accuracy76.7ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE counting balancedAccuracy (%)52LXMERT
Multimodal Text and Image ClassificationVALSE counting balancedpairwise accuracy62.2LXMERT
Multimodal Text and Image ClassificationVALSE counting balancedpairwise accuracy62.1CLIP
Multimodal Text and Image ClassificationVALSE counting balancedAccuracy (%)50.7ViLBERT
Multimodal Text and Image ClassificationVALSE counting balancedpairwise accuracy58.6ViLBERT
Multimodal Text and Image ClassificationVALSE counting balancedpairwise accuracy51.6GPT2
Multimodal Text and Image ClassificationVALSE counting balancedpairwise accuracy51.2GPT1
Multimodal Text and Image ClassificationVALSE counting balancedAccuracy (%)48.3VisualBERT
Multimodal Text and Image ClassificationVALSE counting balancedpairwise accuracy48.2VisualBERT
Multimodal Text and Image ClassificationVALSE actant swappairwise accuracy76.9GPT2
Multimodal Text and Image ClassificationVALSE actant swappairwise accuracy72.2GPT1
Multimodal Text and Image ClassificationVALSE actant swappairwise accuracy68.6CLIP
Multimodal Text and Image ClassificationVALSE actant swapAccuracy (%)50.4ViLBERT
Multimodal Text and Image ClassificationVALSE actant swappairwise accuracy68.3ViLBERT
Multimodal Text and Image ClassificationVALSE actant swapAccuracy (%)52.2ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE actant swappairwise accuracy58.9ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE actant swapAccuracy (%)48.5LXMERT
Multimodal Text and Image ClassificationVALSE actant swappairwise accuracy45.8LXMERT
Multimodal Text and Image ClassificationVALSE actant swapAccuracy (%)49.7VisualBERT
Multimodal Text and Image ClassificationVALSE actant swappairwise accuracy44.4VisualBERT
Multimodal Text and Image ClassificationVALSE coreference cleanAccuracy (%)54.3ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE coreference cleanpairwise accuracy69.2ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE coreference cleanpairwise accuracy50GPT2
Multimodal Text and Image ClassificationVALSE coreference cleanpairwise accuracy49.7CLIP
Multimodal Text and Image ClassificationVALSE coreference cleanAccuracy (%)50ViLBERT
Multimodal Text and Image ClassificationVALSE coreference cleanpairwise accuracy48.1ViLBERT
Multimodal Text and Image ClassificationVALSE coreference cleanAccuracy (%)50VisualBERT
Multimodal Text and Image ClassificationVALSE coreference cleanpairwise accuracy47.6VisualBERT
Multimodal Text and Image ClassificationVALSE coreference cleanpairwise accuracy45.2GPT1
Multimodal Text and Image ClassificationVALSE coreference cleanAccuracy (%)49LXMERT
Multimodal Text and Image ClassificationVALSE coreference cleanpairwise accuracy44.2LXMERT
Multimodal Text and Image ClassificationVALSE counting small numbersAccuracy (%)69.2ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE counting small numberspairwise accuracy80.2ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE counting small numbersAccuracy (%)55.4LXMERT
Multimodal Text and Image ClassificationVALSE counting small numberspairwise accuracy69.2LXMERT
Multimodal Text and Image ClassificationVALSE counting small numbersAccuracy (%)50.6ViLBERT
Multimodal Text and Image ClassificationVALSE counting small numberspairwise accuracy62.9ViLBERT
Multimodal Text and Image ClassificationVALSE counting small numberspairwise accuracy62.5CLIP
Multimodal Text and Image ClassificationVALSE counting small numberspairwise accuracy49.8GPT2
Multimodal Text and Image ClassificationVALSE counting small numberspairwise accuracy48.7GPT1
Multimodal Text and Image ClassificationVALSE counting small numbersAccuracy (%)47.8VisualBERT
Multimodal Text and Image ClassificationVALSE counting small numberspairwise accuracy48.2VisualBERT
Multimodal Text and Image ClassificationVALSE existenceAccuracy (%)89ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE existencepairwise accuracy95.6ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE existenceAccuracy (%)55.8LXMERT
Multimodal Text and Image ClassificationVALSE existencepairwise accuracy78.6LXMERT
Multimodal Text and Image ClassificationVALSE existencepairwise accuracy66.9CLIP
Multimodal Text and Image ClassificationVALSE existenceAccuracy (%)2.4ViLBERT
Multimodal Text and Image ClassificationVALSE existencepairwise accuracy66.5ViLBERT
Multimodal Text and Image ClassificationVALSE existencepairwise accuracy61.8GPT1
Multimodal Text and Image ClassificationVALSE existencepairwise accuracy58GPT2
Multimodal Text and Image ClassificationVALSE existenceAccuracy (%)49.3VisualBERT
Multimodal Text and Image ClassificationVALSE existencepairwise accuracy39.7VisualBERT
Multimodal Text and Image ClassificationVALSE coreference standardAccuracy (%)54.4ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE coreference standardpairwise accuracy75.7ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE coreference standardpairwise accuracy54.5GPT2
Multimodal Text and Image ClassificationVALSE coreference standardpairwise accuracy52.1CLIP
Multimodal Text and Image ClassificationVALSE coreference standardAccuracy (%)50VisualBERT
Multimodal Text and Image ClassificationVALSE coreference standardpairwise accuracy49.5VisualBERT
Multimodal Text and Image ClassificationVALSE coreference standardAccuracy (%)50ViLBERT
Multimodal Text and Image ClassificationVALSE coreference standardpairwise accuracy47.2ViLBERT
Multimodal Text and Image ClassificationVALSE coreference standardAccuracy (%)49.8LXMERT
Multimodal Text and Image ClassificationVALSE coreference standardpairwise accuracy46.8LXMERT
Multimodal Text and Image ClassificationVALSE coreference standardpairwise accuracy45.6GPT1
Multimodal Text and Image ClassificationVALSE spatial relationspairwise accuracy77.2GPT1
Multimodal Text and Image ClassificationVALSE spatial relationspairwise accuracy75GPT2
Multimodal Text and Image ClassificationVALSE spatial relationsAccuracy (%)53.4ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE spatial relationspairwise accuracy67.7ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE spatial relationspairwise accuracy64.3CLIP
Multimodal Text and Image ClassificationVALSE spatial relationsAccuracy (%)50.8LXMERT
Multimodal Text and Image ClassificationVALSE spatial relationspairwise accuracy60.2LXMERT
Multimodal Text and Image ClassificationVALSE spatial relationsAccuracy (%)49.9ViLBERT
Multimodal Text and Image ClassificationVALSE spatial relationspairwise accuracy57.2ViLBERT
Multimodal Text and Image ClassificationVALSE spatial relationsAccuracy (%)49.3VisualBERT
Multimodal Text and Image ClassificationVALSE spatial relationspairwise accuracy39.7VisualBERT
Multimodal Text and Image ClassificationVALSE pluralityAccuracy (%)62ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE pluralitypairwise accuracy72.4ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE pluralityAccuracy (%)55.1LXMERT
Multimodal Text and Image ClassificationVALSE pluralitypairwise accuracy64.4LXMERT
Multimodal Text and Image ClassificationVALSE pluralityAccuracy (%)50.3ViLBERT
Multimodal Text and Image ClassificationVALSE pluralitypairwise accuracy61.2ViLBERT
Multimodal Text and Image ClassificationVALSE pluralitypairwise accuracy56.2CLIP
Multimodal Text and Image ClassificationVALSE pluralitypairwise accuracy53.1GPT1
Multimodal Text and Image ClassificationVALSE pluralitypairwise accuracy51.9GPT2
Multimodal Text and Image ClassificationVALSE pluralityAccuracy (%)46.5VisualBERT
Multimodal Text and Image ClassificationVALSE pluralitypairwise accuracy45.7VisualBERT
Multimodal Text and Image ClassificationVALSE action replacementpairwise accuracy75.6CLIP
Multimodal Text and Image ClassificationVALSE action replacementAccuracy (%)52.6ViLBERT
Multimodal Text and Image ClassificationVALSE action replacementpairwise accuracy70.7ViLBERT
Multimodal Text and Image ClassificationVALSE action replacementpairwise accuracy66.8GPT2
Multimodal Text and Image ClassificationVALSE action replacementAccuracy (%)57.3ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE action replacementpairwise accuracy65.9ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSE action replacementpairwise accuracy65.4GPT1
Multimodal Text and Image ClassificationVALSE action replacementAccuracy (%)51.1LXMERT
Multimodal Text and Image ClassificationVALSE action replacementpairwise accuracy54.8LXMERT
Multimodal Text and Image ClassificationVALSE action replacementAccuracy (%)48.8VisualBERT
Multimodal Text and Image ClassificationVALSE action replacementpairwise accuracy49.2VisualBERT
Multimodal Text and Image ClassificationVALSEAverage Accuracy63.2ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSEaverage pairwise accuracy75.1ViLBERT 12-in-1
Multimodal Text and Image ClassificationVALSEaverage pairwise accuracy64CLIP
Multimodal Text and Image ClassificationVALSEAverage Accuracy51.3ViLBERT
Multimodal Text and Image ClassificationVALSEaverage pairwise accuracy63.7ViLBERT
Multimodal Text and Image ClassificationVALSEaverage pairwise accuracy60.7GPT1
Multimodal Text and Image ClassificationVALSEaverage pairwise accuracy60.1GPT2
Multimodal Text and Image ClassificationVALSEAverage Accuracy53.5LXMERT
Multimodal Text and Image ClassificationVALSEaverage pairwise accuracy59.6LXMERT
Multimodal Text and Image ClassificationVALSEAverage Accuracy48.8VisualBERT
Multimodal Text and Image ClassificationVALSEaverage pairwise accuracy46.4VisualBERT

Related Papers

Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning2021-04-28From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge2015-11-10