VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, Albert Gatt

2021-12-14ACL 2022 5image-sentence alignment

Abstract

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

Results

Task	Dataset	Metric	Value	Model
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	88.8	CLIP
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	70.8	LXMERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	87.1	LXMERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	71.5	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	55.9	ViLBERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	80.7	GPT2
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	77.5	GPT1
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	46.6	VisualBERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	48.5	VisualBERT
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	66.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	77.3	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	51.8	ViLBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	73.7	ViLBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	69.5	GPT1
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	57.5	CLIP
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	50	VisualBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	50	VisualBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	45.3	GPT2
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	49.9	LXMERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	42.6	LXMERT
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	64.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	76.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	52	LXMERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	62.2	LXMERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	62.1	CLIP
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	50.7	ViLBERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	58.6	ViLBERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	51.6	GPT2
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	51.2	GPT1
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	48.3	VisualBERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	48.2	VisualBERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	76.9	GPT2
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	72.2	GPT1
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	68.6	CLIP
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	50.4	ViLBERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	68.3	ViLBERT
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	52.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	58.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	48.5	LXMERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	45.8	LXMERT
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	49.7	VisualBERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	44.4	VisualBERT
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	54.3	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	69.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	50	GPT2
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	49.7	CLIP
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	50	ViLBERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	48.1	ViLBERT
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	50	VisualBERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	47.6	VisualBERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	45.2	GPT1
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	49	LXMERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	44.2	LXMERT
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	69.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	80.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	55.4	LXMERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	69.2	LXMERT
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	50.6	ViLBERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	62.9	ViLBERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	62.5	CLIP
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	49.8	GPT2
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	48.7	GPT1
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	47.8	VisualBERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	48.2	VisualBERT
Multimodal Deep Learning	VALSE existence	Accuracy (%)	89	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE existence	pairwise accuracy	95.6	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE existence	Accuracy (%)	55.8	LXMERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	78.6	LXMERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	66.9	CLIP
Multimodal Deep Learning	VALSE existence	Accuracy (%)	2.4	ViLBERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	66.5	ViLBERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	61.8	GPT1
Multimodal Deep Learning	VALSE existence	pairwise accuracy	58	GPT2
Multimodal Deep Learning	VALSE existence	Accuracy (%)	49.3	VisualBERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	39.7	VisualBERT
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	54.4	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	75.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	54.5	GPT2
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	52.1	CLIP
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	50	VisualBERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	49.5	VisualBERT
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	50	ViLBERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	47.2	ViLBERT
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	49.8	LXMERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	46.8	LXMERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	45.6	GPT1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	77.2	GPT1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	75	GPT2
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	53.4	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	67.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	64.3	CLIP
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	50.8	LXMERT
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	60.2	LXMERT
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	49.9	ViLBERT
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	57.2	ViLBERT
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	49.3	VisualBERT
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	39.7	VisualBERT
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	62	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	72.4	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	55.1	LXMERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	64.4	LXMERT
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	50.3	ViLBERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	61.2	ViLBERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	56.2	CLIP
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	53.1	GPT1
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	51.9	GPT2
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	46.5	VisualBERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	45.7	VisualBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	75.6	CLIP
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	52.6	ViLBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	70.7	ViLBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	66.8	GPT2
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	57.3	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	65.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	65.4	GPT1
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	51.1	LXMERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	54.8	LXMERT
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	48.8	VisualBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	49.2	VisualBERT
Multimodal Deep Learning	VALSE	Average Accuracy	63.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE	average pairwise accuracy	75.1	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE	average pairwise accuracy	64	CLIP
Multimodal Deep Learning	VALSE	Average Accuracy	51.3	ViLBERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	63.7	ViLBERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	60.7	GPT1
Multimodal Deep Learning	VALSE	average pairwise accuracy	60.1	GPT2
Multimodal Deep Learning	VALSE	Average Accuracy	53.5	LXMERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	59.6	LXMERT
Multimodal Deep Learning	VALSE	Average Accuracy	48.8	VisualBERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	46.4	VisualBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	88.8	CLIP
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	70.8	LXMERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	87.1	LXMERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	71.5	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	55.9	ViLBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	80.7	GPT2
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	77.5	GPT1
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	46.6	VisualBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	48.5	VisualBERT
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	66.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	77.3	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	51.8	ViLBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	73.7	ViLBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	69.5	GPT1
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	57.5	CLIP
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	50	VisualBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	50	VisualBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	45.3	GPT2
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	49.9	LXMERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	42.6	LXMERT
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	64.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	76.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	52	LXMERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	62.2	LXMERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	62.1	CLIP
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	50.7	ViLBERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	58.6	ViLBERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	51.6	GPT2
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	51.2	GPT1
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	48.3	VisualBERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	48.2	VisualBERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	76.9	GPT2
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	72.2	GPT1
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	68.6	CLIP
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	50.4	ViLBERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	68.3	ViLBERT
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	52.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	58.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	48.5	LXMERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	45.8	LXMERT
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	49.7	VisualBERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	44.4	VisualBERT
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	54.3	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	69.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	50	GPT2
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	49.7	CLIP
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	50	ViLBERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	48.1	ViLBERT
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	50	VisualBERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	47.6	VisualBERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	45.2	GPT1
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	49	LXMERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	44.2	LXMERT
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	69.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	80.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	55.4	LXMERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	69.2	LXMERT
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	50.6	ViLBERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	62.9	ViLBERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	62.5	CLIP
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	49.8	GPT2
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	48.7	GPT1
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	47.8	VisualBERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	48.2	VisualBERT
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	89	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	95.6	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	55.8	LXMERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	78.6	LXMERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	66.9	CLIP
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	2.4	ViLBERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	66.5	ViLBERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	61.8	GPT1
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	58	GPT2
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	49.3	VisualBERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	39.7	VisualBERT
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	54.4	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	75.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	54.5	GPT2
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	52.1	CLIP
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	50	VisualBERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	49.5	VisualBERT
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	50	ViLBERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	47.2	ViLBERT
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	49.8	LXMERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	46.8	LXMERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	45.6	GPT1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	77.2	GPT1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	75	GPT2
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	53.4	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	67.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	64.3	CLIP
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	50.8	LXMERT
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	60.2	LXMERT
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	49.9	ViLBERT
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	57.2	ViLBERT
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	49.3	VisualBERT
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	39.7	VisualBERT
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	62	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	72.4	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	55.1	LXMERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	64.4	LXMERT
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	50.3	ViLBERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	61.2	ViLBERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	56.2	CLIP
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	53.1	GPT1
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	51.9	GPT2
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	46.5	VisualBERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	45.7	VisualBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	75.6	CLIP
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	52.6	ViLBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	70.7	ViLBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	66.8	GPT2
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	57.3	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	65.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	65.4	GPT1
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	51.1	LXMERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	54.8	LXMERT
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	48.8	VisualBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	49.2	VisualBERT
Multimodal Text and Image Classification	VALSE	Average Accuracy	63.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	75.1	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	64	CLIP
Multimodal Text and Image Classification	VALSE	Average Accuracy	51.3	ViLBERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	63.7	ViLBERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	60.7	GPT1
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	60.1	GPT2
Multimodal Text and Image Classification	VALSE	Average Accuracy	53.5	LXMERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	59.6	LXMERT
Multimodal Text and Image Classification	VALSE	Average Accuracy	48.8	VisualBERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	46.4	VisualBERT

Abstract

Results

Task	Dataset	Metric	Value	Model
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	88.8	CLIP
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	70.8	LXMERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	87.1	LXMERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	71.5	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	55.9	ViLBERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	80.7	GPT2
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	77.5	GPT1
Multimodal Deep Learning	VALSE foil-it (noun phrases)	Accuracy (%)	46.6	VisualBERT
Multimodal Deep Learning	VALSE foil-it (noun phrases)	pairwise accuracy	48.5	VisualBERT
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	66.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	77.3	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	51.8	ViLBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	73.7	ViLBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	69.5	GPT1
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	57.5	CLIP
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	50	VisualBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	50	VisualBERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	45.3	GPT2
Multimodal Deep Learning	VALSE counting adversarial	Accuracy (%)	49.9	LXMERT
Multimodal Deep Learning	VALSE counting adversarial	pairwise accuracy	42.6	LXMERT
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	64.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	76.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	52	LXMERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	62.2	LXMERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	62.1	CLIP
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	50.7	ViLBERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	58.6	ViLBERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	51.6	GPT2
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	51.2	GPT1
Multimodal Deep Learning	VALSE counting balanced	Accuracy (%)	48.3	VisualBERT
Multimodal Deep Learning	VALSE counting balanced	pairwise accuracy	48.2	VisualBERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	76.9	GPT2
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	72.2	GPT1
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	68.6	CLIP
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	50.4	ViLBERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	68.3	ViLBERT
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	52.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	58.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	48.5	LXMERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	45.8	LXMERT
Multimodal Deep Learning	VALSE actant swap	Accuracy (%)	49.7	VisualBERT
Multimodal Deep Learning	VALSE actant swap	pairwise accuracy	44.4	VisualBERT
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	54.3	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	69.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	50	GPT2
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	49.7	CLIP
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	50	ViLBERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	48.1	ViLBERT
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	50	VisualBERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	47.6	VisualBERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	45.2	GPT1
Multimodal Deep Learning	VALSE coreference clean	Accuracy (%)	49	LXMERT
Multimodal Deep Learning	VALSE coreference clean	pairwise accuracy	44.2	LXMERT
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	69.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	80.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	55.4	LXMERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	69.2	LXMERT
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	50.6	ViLBERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	62.9	ViLBERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	62.5	CLIP
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	49.8	GPT2
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	48.7	GPT1
Multimodal Deep Learning	VALSE counting small numbers	Accuracy (%)	47.8	VisualBERT
Multimodal Deep Learning	VALSE counting small numbers	pairwise accuracy	48.2	VisualBERT
Multimodal Deep Learning	VALSE existence	Accuracy (%)	89	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE existence	pairwise accuracy	95.6	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE existence	Accuracy (%)	55.8	LXMERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	78.6	LXMERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	66.9	CLIP
Multimodal Deep Learning	VALSE existence	Accuracy (%)	2.4	ViLBERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	66.5	ViLBERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	61.8	GPT1
Multimodal Deep Learning	VALSE existence	pairwise accuracy	58	GPT2
Multimodal Deep Learning	VALSE existence	Accuracy (%)	49.3	VisualBERT
Multimodal Deep Learning	VALSE existence	pairwise accuracy	39.7	VisualBERT
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	54.4	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	75.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	54.5	GPT2
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	52.1	CLIP
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	50	VisualBERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	49.5	VisualBERT
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	50	ViLBERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	47.2	ViLBERT
Multimodal Deep Learning	VALSE coreference standard	Accuracy (%)	49.8	LXMERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	46.8	LXMERT
Multimodal Deep Learning	VALSE coreference standard	pairwise accuracy	45.6	GPT1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	77.2	GPT1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	75	GPT2
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	53.4	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	67.7	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	64.3	CLIP
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	50.8	LXMERT
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	60.2	LXMERT
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	49.9	ViLBERT
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	57.2	ViLBERT
Multimodal Deep Learning	VALSE spatial relations	Accuracy (%)	49.3	VisualBERT
Multimodal Deep Learning	VALSE spatial relations	pairwise accuracy	39.7	VisualBERT
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	62	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	72.4	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	55.1	LXMERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	64.4	LXMERT
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	50.3	ViLBERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	61.2	ViLBERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	56.2	CLIP
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	53.1	GPT1
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	51.9	GPT2
Multimodal Deep Learning	VALSE plurality	Accuracy (%)	46.5	VisualBERT
Multimodal Deep Learning	VALSE plurality	pairwise accuracy	45.7	VisualBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	75.6	CLIP
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	52.6	ViLBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	70.7	ViLBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	66.8	GPT2
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	57.3	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	65.9	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	65.4	GPT1
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	51.1	LXMERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	54.8	LXMERT
Multimodal Deep Learning	VALSE action replacement	Accuracy (%)	48.8	VisualBERT
Multimodal Deep Learning	VALSE action replacement	pairwise accuracy	49.2	VisualBERT
Multimodal Deep Learning	VALSE	Average Accuracy	63.2	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE	average pairwise accuracy	75.1	ViLBERT 12-in-1
Multimodal Deep Learning	VALSE	average pairwise accuracy	64	CLIP
Multimodal Deep Learning	VALSE	Average Accuracy	51.3	ViLBERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	63.7	ViLBERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	60.7	GPT1
Multimodal Deep Learning	VALSE	average pairwise accuracy	60.1	GPT2
Multimodal Deep Learning	VALSE	Average Accuracy	53.5	LXMERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	59.6	LXMERT
Multimodal Deep Learning	VALSE	Average Accuracy	48.8	VisualBERT
Multimodal Deep Learning	VALSE	average pairwise accuracy	46.4	VisualBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	88.8	CLIP
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	70.8	LXMERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	87.1	LXMERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	71.5	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	55.9	ViLBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	86.9	ViLBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	80.7	GPT2
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	77.5	GPT1
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	Accuracy (%)	46.6	VisualBERT
Multimodal Text and Image Classification	VALSE foil-it (noun phrases)	pairwise accuracy	48.5	VisualBERT
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	66.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	77.3	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	51.8	ViLBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	73.7	ViLBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	69.5	GPT1
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	57.5	CLIP
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	50	VisualBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	50	VisualBERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	45.3	GPT2
Multimodal Text and Image Classification	VALSE counting adversarial	Accuracy (%)	49.9	LXMERT
Multimodal Text and Image Classification	VALSE counting adversarial	pairwise accuracy	42.6	LXMERT
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	64.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	76.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	52	LXMERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	62.2	LXMERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	62.1	CLIP
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	50.7	ViLBERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	58.6	ViLBERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	51.6	GPT2
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	51.2	GPT1
Multimodal Text and Image Classification	VALSE counting balanced	Accuracy (%)	48.3	VisualBERT
Multimodal Text and Image Classification	VALSE counting balanced	pairwise accuracy	48.2	VisualBERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	76.9	GPT2
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	72.2	GPT1
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	68.6	CLIP
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	50.4	ViLBERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	68.3	ViLBERT
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	52.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	58.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	48.5	LXMERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	45.8	LXMERT
Multimodal Text and Image Classification	VALSE actant swap	Accuracy (%)	49.7	VisualBERT
Multimodal Text and Image Classification	VALSE actant swap	pairwise accuracy	44.4	VisualBERT
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	54.3	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	69.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	50	GPT2
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	49.7	CLIP
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	50	ViLBERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	48.1	ViLBERT
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	50	VisualBERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	47.6	VisualBERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	45.2	GPT1
Multimodal Text and Image Classification	VALSE coreference clean	Accuracy (%)	49	LXMERT
Multimodal Text and Image Classification	VALSE coreference clean	pairwise accuracy	44.2	LXMERT
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	69.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	80.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	55.4	LXMERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	69.2	LXMERT
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	50.6	ViLBERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	62.9	ViLBERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	62.5	CLIP
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	49.8	GPT2
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	48.7	GPT1
Multimodal Text and Image Classification	VALSE counting small numbers	Accuracy (%)	47.8	VisualBERT
Multimodal Text and Image Classification	VALSE counting small numbers	pairwise accuracy	48.2	VisualBERT
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	89	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	95.6	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	55.8	LXMERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	78.6	LXMERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	66.9	CLIP
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	2.4	ViLBERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	66.5	ViLBERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	61.8	GPT1
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	58	GPT2
Multimodal Text and Image Classification	VALSE existence	Accuracy (%)	49.3	VisualBERT
Multimodal Text and Image Classification	VALSE existence	pairwise accuracy	39.7	VisualBERT
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	54.4	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	75.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	54.5	GPT2
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	52.1	CLIP
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	50	VisualBERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	49.5	VisualBERT
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	50	ViLBERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	47.2	ViLBERT
Multimodal Text and Image Classification	VALSE coreference standard	Accuracy (%)	49.8	LXMERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	46.8	LXMERT
Multimodal Text and Image Classification	VALSE coreference standard	pairwise accuracy	45.6	GPT1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	77.2	GPT1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	75	GPT2
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	53.4	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	67.7	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	64.3	CLIP
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	50.8	LXMERT
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	60.2	LXMERT
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	49.9	ViLBERT
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	57.2	ViLBERT
Multimodal Text and Image Classification	VALSE spatial relations	Accuracy (%)	49.3	VisualBERT
Multimodal Text and Image Classification	VALSE spatial relations	pairwise accuracy	39.7	VisualBERT
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	62	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	72.4	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	55.1	LXMERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	64.4	LXMERT
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	50.3	ViLBERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	61.2	ViLBERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	56.2	CLIP
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	53.1	GPT1
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	51.9	GPT2
Multimodal Text and Image Classification	VALSE plurality	Accuracy (%)	46.5	VisualBERT
Multimodal Text and Image Classification	VALSE plurality	pairwise accuracy	45.7	VisualBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	75.6	CLIP
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	52.6	ViLBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	70.7	ViLBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	66.8	GPT2
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	57.3	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	65.9	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	65.4	GPT1
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	51.1	LXMERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	54.8	LXMERT
Multimodal Text and Image Classification	VALSE action replacement	Accuracy (%)	48.8	VisualBERT
Multimodal Text and Image Classification	VALSE action replacement	pairwise accuracy	49.2	VisualBERT
Multimodal Text and Image Classification	VALSE	Average Accuracy	63.2	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	75.1	ViLBERT 12-in-1
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	64	CLIP
Multimodal Text and Image Classification	VALSE	Average Accuracy	51.3	ViLBERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	63.7	ViLBERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	60.7	GPT1
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	60.1	GPT2
Multimodal Text and Image Classification	VALSE	Average Accuracy	53.5	LXMERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	59.6	LXMERT
Multimodal Text and Image Classification	VALSE	Average Accuracy	48.8	VisualBERT
Multimodal Text and Image Classification	VALSE	average pairwise accuracy	46.4	VisualBERT

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Abstract

Results

Related Papers

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Abstract

Results

Related Papers