Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, Albert Gatt
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | pairwise accuracy | 88.8 | CLIP |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | Accuracy (%) | 70.8 | LXMERT |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | pairwise accuracy | 87.1 | LXMERT |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | Accuracy (%) | 71.5 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | pairwise accuracy | 86.9 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | Accuracy (%) | 55.9 | ViLBERT |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | pairwise accuracy | 86.9 | ViLBERT |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | pairwise accuracy | 80.7 | GPT2 |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | pairwise accuracy | 77.5 | GPT1 |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | Accuracy (%) | 46.6 | VisualBERT |
| Multimodal Deep Learning | VALSE foil-it (noun phrases) | pairwise accuracy | 48.5 | VisualBERT |
| Multimodal Deep Learning | VALSE counting adversarial | Accuracy (%) | 66.7 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE counting adversarial | pairwise accuracy | 77.3 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE counting adversarial | Accuracy (%) | 51.8 | ViLBERT |
| Multimodal Deep Learning | VALSE counting adversarial | pairwise accuracy | 73.7 | ViLBERT |
| Multimodal Deep Learning | VALSE counting adversarial | pairwise accuracy | 69.5 | GPT1 |
| Multimodal Deep Learning | VALSE counting adversarial | pairwise accuracy | 57.5 | CLIP |
| Multimodal Deep Learning | VALSE counting adversarial | Accuracy (%) | 50 | VisualBERT |
| Multimodal Deep Learning | VALSE counting adversarial | pairwise accuracy | 50 | VisualBERT |
| Multimodal Deep Learning | VALSE counting adversarial | pairwise accuracy | 45.3 | GPT2 |
| Multimodal Deep Learning | VALSE counting adversarial | Accuracy (%) | 49.9 | LXMERT |
| Multimodal Deep Learning | VALSE counting adversarial | pairwise accuracy | 42.6 | LXMERT |
| Multimodal Deep Learning | VALSE counting balanced | Accuracy (%) | 64.9 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE counting balanced | pairwise accuracy | 76.7 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE counting balanced | Accuracy (%) | 52 | LXMERT |
| Multimodal Deep Learning | VALSE counting balanced | pairwise accuracy | 62.2 | LXMERT |
| Multimodal Deep Learning | VALSE counting balanced | pairwise accuracy | 62.1 | CLIP |
| Multimodal Deep Learning | VALSE counting balanced | Accuracy (%) | 50.7 | ViLBERT |
| Multimodal Deep Learning | VALSE counting balanced | pairwise accuracy | 58.6 | ViLBERT |
| Multimodal Deep Learning | VALSE counting balanced | pairwise accuracy | 51.6 | GPT2 |
| Multimodal Deep Learning | VALSE counting balanced | pairwise accuracy | 51.2 | GPT1 |
| Multimodal Deep Learning | VALSE counting balanced | Accuracy (%) | 48.3 | VisualBERT |
| Multimodal Deep Learning | VALSE counting balanced | pairwise accuracy | 48.2 | VisualBERT |
| Multimodal Deep Learning | VALSE actant swap | pairwise accuracy | 76.9 | GPT2 |
| Multimodal Deep Learning | VALSE actant swap | pairwise accuracy | 72.2 | GPT1 |
| Multimodal Deep Learning | VALSE actant swap | pairwise accuracy | 68.6 | CLIP |
| Multimodal Deep Learning | VALSE actant swap | Accuracy (%) | 50.4 | ViLBERT |
| Multimodal Deep Learning | VALSE actant swap | pairwise accuracy | 68.3 | ViLBERT |
| Multimodal Deep Learning | VALSE actant swap | Accuracy (%) | 52.2 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE actant swap | pairwise accuracy | 58.9 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE actant swap | Accuracy (%) | 48.5 | LXMERT |
| Multimodal Deep Learning | VALSE actant swap | pairwise accuracy | 45.8 | LXMERT |
| Multimodal Deep Learning | VALSE actant swap | Accuracy (%) | 49.7 | VisualBERT |
| Multimodal Deep Learning | VALSE actant swap | pairwise accuracy | 44.4 | VisualBERT |
| Multimodal Deep Learning | VALSE coreference clean | Accuracy (%) | 54.3 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE coreference clean | pairwise accuracy | 69.2 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE coreference clean | pairwise accuracy | 50 | GPT2 |
| Multimodal Deep Learning | VALSE coreference clean | pairwise accuracy | 49.7 | CLIP |
| Multimodal Deep Learning | VALSE coreference clean | Accuracy (%) | 50 | ViLBERT |
| Multimodal Deep Learning | VALSE coreference clean | pairwise accuracy | 48.1 | ViLBERT |
| Multimodal Deep Learning | VALSE coreference clean | Accuracy (%) | 50 | VisualBERT |
| Multimodal Deep Learning | VALSE coreference clean | pairwise accuracy | 47.6 | VisualBERT |
| Multimodal Deep Learning | VALSE coreference clean | pairwise accuracy | 45.2 | GPT1 |
| Multimodal Deep Learning | VALSE coreference clean | Accuracy (%) | 49 | LXMERT |
| Multimodal Deep Learning | VALSE coreference clean | pairwise accuracy | 44.2 | LXMERT |
| Multimodal Deep Learning | VALSE counting small numbers | Accuracy (%) | 69.2 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE counting small numbers | pairwise accuracy | 80.2 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE counting small numbers | Accuracy (%) | 55.4 | LXMERT |
| Multimodal Deep Learning | VALSE counting small numbers | pairwise accuracy | 69.2 | LXMERT |
| Multimodal Deep Learning | VALSE counting small numbers | Accuracy (%) | 50.6 | ViLBERT |
| Multimodal Deep Learning | VALSE counting small numbers | pairwise accuracy | 62.9 | ViLBERT |
| Multimodal Deep Learning | VALSE counting small numbers | pairwise accuracy | 62.5 | CLIP |
| Multimodal Deep Learning | VALSE counting small numbers | pairwise accuracy | 49.8 | GPT2 |
| Multimodal Deep Learning | VALSE counting small numbers | pairwise accuracy | 48.7 | GPT1 |
| Multimodal Deep Learning | VALSE counting small numbers | Accuracy (%) | 47.8 | VisualBERT |
| Multimodal Deep Learning | VALSE counting small numbers | pairwise accuracy | 48.2 | VisualBERT |
| Multimodal Deep Learning | VALSE existence | Accuracy (%) | 89 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE existence | pairwise accuracy | 95.6 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE existence | Accuracy (%) | 55.8 | LXMERT |
| Multimodal Deep Learning | VALSE existence | pairwise accuracy | 78.6 | LXMERT |
| Multimodal Deep Learning | VALSE existence | pairwise accuracy | 66.9 | CLIP |
| Multimodal Deep Learning | VALSE existence | Accuracy (%) | 2.4 | ViLBERT |
| Multimodal Deep Learning | VALSE existence | pairwise accuracy | 66.5 | ViLBERT |
| Multimodal Deep Learning | VALSE existence | pairwise accuracy | 61.8 | GPT1 |
| Multimodal Deep Learning | VALSE existence | pairwise accuracy | 58 | GPT2 |
| Multimodal Deep Learning | VALSE existence | Accuracy (%) | 49.3 | VisualBERT |
| Multimodal Deep Learning | VALSE existence | pairwise accuracy | 39.7 | VisualBERT |
| Multimodal Deep Learning | VALSE coreference standard | Accuracy (%) | 54.4 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE coreference standard | pairwise accuracy | 75.7 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE coreference standard | pairwise accuracy | 54.5 | GPT2 |
| Multimodal Deep Learning | VALSE coreference standard | pairwise accuracy | 52.1 | CLIP |
| Multimodal Deep Learning | VALSE coreference standard | Accuracy (%) | 50 | VisualBERT |
| Multimodal Deep Learning | VALSE coreference standard | pairwise accuracy | 49.5 | VisualBERT |
| Multimodal Deep Learning | VALSE coreference standard | Accuracy (%) | 50 | ViLBERT |
| Multimodal Deep Learning | VALSE coreference standard | pairwise accuracy | 47.2 | ViLBERT |
| Multimodal Deep Learning | VALSE coreference standard | Accuracy (%) | 49.8 | LXMERT |
| Multimodal Deep Learning | VALSE coreference standard | pairwise accuracy | 46.8 | LXMERT |
| Multimodal Deep Learning | VALSE coreference standard | pairwise accuracy | 45.6 | GPT1 |
| Multimodal Deep Learning | VALSE spatial relations | pairwise accuracy | 77.2 | GPT1 |
| Multimodal Deep Learning | VALSE spatial relations | pairwise accuracy | 75 | GPT2 |
| Multimodal Deep Learning | VALSE spatial relations | Accuracy (%) | 53.4 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE spatial relations | pairwise accuracy | 67.7 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE spatial relations | pairwise accuracy | 64.3 | CLIP |
| Multimodal Deep Learning | VALSE spatial relations | Accuracy (%) | 50.8 | LXMERT |
| Multimodal Deep Learning | VALSE spatial relations | pairwise accuracy | 60.2 | LXMERT |
| Multimodal Deep Learning | VALSE spatial relations | Accuracy (%) | 49.9 | ViLBERT |
| Multimodal Deep Learning | VALSE spatial relations | pairwise accuracy | 57.2 | ViLBERT |
| Multimodal Deep Learning | VALSE spatial relations | Accuracy (%) | 49.3 | VisualBERT |
| Multimodal Deep Learning | VALSE spatial relations | pairwise accuracy | 39.7 | VisualBERT |
| Multimodal Deep Learning | VALSE plurality | Accuracy (%) | 62 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE plurality | pairwise accuracy | 72.4 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE plurality | Accuracy (%) | 55.1 | LXMERT |
| Multimodal Deep Learning | VALSE plurality | pairwise accuracy | 64.4 | LXMERT |
| Multimodal Deep Learning | VALSE plurality | Accuracy (%) | 50.3 | ViLBERT |
| Multimodal Deep Learning | VALSE plurality | pairwise accuracy | 61.2 | ViLBERT |
| Multimodal Deep Learning | VALSE plurality | pairwise accuracy | 56.2 | CLIP |
| Multimodal Deep Learning | VALSE plurality | pairwise accuracy | 53.1 | GPT1 |
| Multimodal Deep Learning | VALSE plurality | pairwise accuracy | 51.9 | GPT2 |
| Multimodal Deep Learning | VALSE plurality | Accuracy (%) | 46.5 | VisualBERT |
| Multimodal Deep Learning | VALSE plurality | pairwise accuracy | 45.7 | VisualBERT |
| Multimodal Deep Learning | VALSE action replacement | pairwise accuracy | 75.6 | CLIP |
| Multimodal Deep Learning | VALSE action replacement | Accuracy (%) | 52.6 | ViLBERT |
| Multimodal Deep Learning | VALSE action replacement | pairwise accuracy | 70.7 | ViLBERT |
| Multimodal Deep Learning | VALSE action replacement | pairwise accuracy | 66.8 | GPT2 |
| Multimodal Deep Learning | VALSE action replacement | Accuracy (%) | 57.3 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE action replacement | pairwise accuracy | 65.9 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE action replacement | pairwise accuracy | 65.4 | GPT1 |
| Multimodal Deep Learning | VALSE action replacement | Accuracy (%) | 51.1 | LXMERT |
| Multimodal Deep Learning | VALSE action replacement | pairwise accuracy | 54.8 | LXMERT |
| Multimodal Deep Learning | VALSE action replacement | Accuracy (%) | 48.8 | VisualBERT |
| Multimodal Deep Learning | VALSE action replacement | pairwise accuracy | 49.2 | VisualBERT |
| Multimodal Deep Learning | VALSE | Average Accuracy | 63.2 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE | average pairwise accuracy | 75.1 | ViLBERT 12-in-1 |
| Multimodal Deep Learning | VALSE | average pairwise accuracy | 64 | CLIP |
| Multimodal Deep Learning | VALSE | Average Accuracy | 51.3 | ViLBERT |
| Multimodal Deep Learning | VALSE | average pairwise accuracy | 63.7 | ViLBERT |
| Multimodal Deep Learning | VALSE | average pairwise accuracy | 60.7 | GPT1 |
| Multimodal Deep Learning | VALSE | average pairwise accuracy | 60.1 | GPT2 |
| Multimodal Deep Learning | VALSE | Average Accuracy | 53.5 | LXMERT |
| Multimodal Deep Learning | VALSE | average pairwise accuracy | 59.6 | LXMERT |
| Multimodal Deep Learning | VALSE | Average Accuracy | 48.8 | VisualBERT |
| Multimodal Deep Learning | VALSE | average pairwise accuracy | 46.4 | VisualBERT |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | pairwise accuracy | 88.8 | CLIP |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | Accuracy (%) | 70.8 | LXMERT |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | pairwise accuracy | 87.1 | LXMERT |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | Accuracy (%) | 71.5 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | pairwise accuracy | 86.9 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | Accuracy (%) | 55.9 | ViLBERT |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | pairwise accuracy | 86.9 | ViLBERT |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | pairwise accuracy | 80.7 | GPT2 |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | pairwise accuracy | 77.5 | GPT1 |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | Accuracy (%) | 46.6 | VisualBERT |
| Multimodal Text and Image Classification | VALSE foil-it (noun phrases) | pairwise accuracy | 48.5 | VisualBERT |
| Multimodal Text and Image Classification | VALSE counting adversarial | Accuracy (%) | 66.7 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE counting adversarial | pairwise accuracy | 77.3 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE counting adversarial | Accuracy (%) | 51.8 | ViLBERT |
| Multimodal Text and Image Classification | VALSE counting adversarial | pairwise accuracy | 73.7 | ViLBERT |
| Multimodal Text and Image Classification | VALSE counting adversarial | pairwise accuracy | 69.5 | GPT1 |
| Multimodal Text and Image Classification | VALSE counting adversarial | pairwise accuracy | 57.5 | CLIP |
| Multimodal Text and Image Classification | VALSE counting adversarial | Accuracy (%) | 50 | VisualBERT |
| Multimodal Text and Image Classification | VALSE counting adversarial | pairwise accuracy | 50 | VisualBERT |
| Multimodal Text and Image Classification | VALSE counting adversarial | pairwise accuracy | 45.3 | GPT2 |
| Multimodal Text and Image Classification | VALSE counting adversarial | Accuracy (%) | 49.9 | LXMERT |
| Multimodal Text and Image Classification | VALSE counting adversarial | pairwise accuracy | 42.6 | LXMERT |
| Multimodal Text and Image Classification | VALSE counting balanced | Accuracy (%) | 64.9 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE counting balanced | pairwise accuracy | 76.7 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE counting balanced | Accuracy (%) | 52 | LXMERT |
| Multimodal Text and Image Classification | VALSE counting balanced | pairwise accuracy | 62.2 | LXMERT |
| Multimodal Text and Image Classification | VALSE counting balanced | pairwise accuracy | 62.1 | CLIP |
| Multimodal Text and Image Classification | VALSE counting balanced | Accuracy (%) | 50.7 | ViLBERT |
| Multimodal Text and Image Classification | VALSE counting balanced | pairwise accuracy | 58.6 | ViLBERT |
| Multimodal Text and Image Classification | VALSE counting balanced | pairwise accuracy | 51.6 | GPT2 |
| Multimodal Text and Image Classification | VALSE counting balanced | pairwise accuracy | 51.2 | GPT1 |
| Multimodal Text and Image Classification | VALSE counting balanced | Accuracy (%) | 48.3 | VisualBERT |
| Multimodal Text and Image Classification | VALSE counting balanced | pairwise accuracy | 48.2 | VisualBERT |
| Multimodal Text and Image Classification | VALSE actant swap | pairwise accuracy | 76.9 | GPT2 |
| Multimodal Text and Image Classification | VALSE actant swap | pairwise accuracy | 72.2 | GPT1 |
| Multimodal Text and Image Classification | VALSE actant swap | pairwise accuracy | 68.6 | CLIP |
| Multimodal Text and Image Classification | VALSE actant swap | Accuracy (%) | 50.4 | ViLBERT |
| Multimodal Text and Image Classification | VALSE actant swap | pairwise accuracy | 68.3 | ViLBERT |
| Multimodal Text and Image Classification | VALSE actant swap | Accuracy (%) | 52.2 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE actant swap | pairwise accuracy | 58.9 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE actant swap | Accuracy (%) | 48.5 | LXMERT |
| Multimodal Text and Image Classification | VALSE actant swap | pairwise accuracy | 45.8 | LXMERT |
| Multimodal Text and Image Classification | VALSE actant swap | Accuracy (%) | 49.7 | VisualBERT |
| Multimodal Text and Image Classification | VALSE actant swap | pairwise accuracy | 44.4 | VisualBERT |
| Multimodal Text and Image Classification | VALSE coreference clean | Accuracy (%) | 54.3 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE coreference clean | pairwise accuracy | 69.2 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE coreference clean | pairwise accuracy | 50 | GPT2 |
| Multimodal Text and Image Classification | VALSE coreference clean | pairwise accuracy | 49.7 | CLIP |
| Multimodal Text and Image Classification | VALSE coreference clean | Accuracy (%) | 50 | ViLBERT |
| Multimodal Text and Image Classification | VALSE coreference clean | pairwise accuracy | 48.1 | ViLBERT |
| Multimodal Text and Image Classification | VALSE coreference clean | Accuracy (%) | 50 | VisualBERT |
| Multimodal Text and Image Classification | VALSE coreference clean | pairwise accuracy | 47.6 | VisualBERT |
| Multimodal Text and Image Classification | VALSE coreference clean | pairwise accuracy | 45.2 | GPT1 |
| Multimodal Text and Image Classification | VALSE coreference clean | Accuracy (%) | 49 | LXMERT |
| Multimodal Text and Image Classification | VALSE coreference clean | pairwise accuracy | 44.2 | LXMERT |
| Multimodal Text and Image Classification | VALSE counting small numbers | Accuracy (%) | 69.2 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE counting small numbers | pairwise accuracy | 80.2 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE counting small numbers | Accuracy (%) | 55.4 | LXMERT |
| Multimodal Text and Image Classification | VALSE counting small numbers | pairwise accuracy | 69.2 | LXMERT |
| Multimodal Text and Image Classification | VALSE counting small numbers | Accuracy (%) | 50.6 | ViLBERT |
| Multimodal Text and Image Classification | VALSE counting small numbers | pairwise accuracy | 62.9 | ViLBERT |
| Multimodal Text and Image Classification | VALSE counting small numbers | pairwise accuracy | 62.5 | CLIP |
| Multimodal Text and Image Classification | VALSE counting small numbers | pairwise accuracy | 49.8 | GPT2 |
| Multimodal Text and Image Classification | VALSE counting small numbers | pairwise accuracy | 48.7 | GPT1 |
| Multimodal Text and Image Classification | VALSE counting small numbers | Accuracy (%) | 47.8 | VisualBERT |
| Multimodal Text and Image Classification | VALSE counting small numbers | pairwise accuracy | 48.2 | VisualBERT |
| Multimodal Text and Image Classification | VALSE existence | Accuracy (%) | 89 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE existence | pairwise accuracy | 95.6 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE existence | Accuracy (%) | 55.8 | LXMERT |
| Multimodal Text and Image Classification | VALSE existence | pairwise accuracy | 78.6 | LXMERT |
| Multimodal Text and Image Classification | VALSE existence | pairwise accuracy | 66.9 | CLIP |
| Multimodal Text and Image Classification | VALSE existence | Accuracy (%) | 2.4 | ViLBERT |
| Multimodal Text and Image Classification | VALSE existence | pairwise accuracy | 66.5 | ViLBERT |
| Multimodal Text and Image Classification | VALSE existence | pairwise accuracy | 61.8 | GPT1 |
| Multimodal Text and Image Classification | VALSE existence | pairwise accuracy | 58 | GPT2 |
| Multimodal Text and Image Classification | VALSE existence | Accuracy (%) | 49.3 | VisualBERT |
| Multimodal Text and Image Classification | VALSE existence | pairwise accuracy | 39.7 | VisualBERT |
| Multimodal Text and Image Classification | VALSE coreference standard | Accuracy (%) | 54.4 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE coreference standard | pairwise accuracy | 75.7 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE coreference standard | pairwise accuracy | 54.5 | GPT2 |
| Multimodal Text and Image Classification | VALSE coreference standard | pairwise accuracy | 52.1 | CLIP |
| Multimodal Text and Image Classification | VALSE coreference standard | Accuracy (%) | 50 | VisualBERT |
| Multimodal Text and Image Classification | VALSE coreference standard | pairwise accuracy | 49.5 | VisualBERT |
| Multimodal Text and Image Classification | VALSE coreference standard | Accuracy (%) | 50 | ViLBERT |
| Multimodal Text and Image Classification | VALSE coreference standard | pairwise accuracy | 47.2 | ViLBERT |
| Multimodal Text and Image Classification | VALSE coreference standard | Accuracy (%) | 49.8 | LXMERT |
| Multimodal Text and Image Classification | VALSE coreference standard | pairwise accuracy | 46.8 | LXMERT |
| Multimodal Text and Image Classification | VALSE coreference standard | pairwise accuracy | 45.6 | GPT1 |
| Multimodal Text and Image Classification | VALSE spatial relations | pairwise accuracy | 77.2 | GPT1 |
| Multimodal Text and Image Classification | VALSE spatial relations | pairwise accuracy | 75 | GPT2 |
| Multimodal Text and Image Classification | VALSE spatial relations | Accuracy (%) | 53.4 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE spatial relations | pairwise accuracy | 67.7 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE spatial relations | pairwise accuracy | 64.3 | CLIP |
| Multimodal Text and Image Classification | VALSE spatial relations | Accuracy (%) | 50.8 | LXMERT |
| Multimodal Text and Image Classification | VALSE spatial relations | pairwise accuracy | 60.2 | LXMERT |
| Multimodal Text and Image Classification | VALSE spatial relations | Accuracy (%) | 49.9 | ViLBERT |
| Multimodal Text and Image Classification | VALSE spatial relations | pairwise accuracy | 57.2 | ViLBERT |
| Multimodal Text and Image Classification | VALSE spatial relations | Accuracy (%) | 49.3 | VisualBERT |
| Multimodal Text and Image Classification | VALSE spatial relations | pairwise accuracy | 39.7 | VisualBERT |
| Multimodal Text and Image Classification | VALSE plurality | Accuracy (%) | 62 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE plurality | pairwise accuracy | 72.4 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE plurality | Accuracy (%) | 55.1 | LXMERT |
| Multimodal Text and Image Classification | VALSE plurality | pairwise accuracy | 64.4 | LXMERT |
| Multimodal Text and Image Classification | VALSE plurality | Accuracy (%) | 50.3 | ViLBERT |
| Multimodal Text and Image Classification | VALSE plurality | pairwise accuracy | 61.2 | ViLBERT |
| Multimodal Text and Image Classification | VALSE plurality | pairwise accuracy | 56.2 | CLIP |
| Multimodal Text and Image Classification | VALSE plurality | pairwise accuracy | 53.1 | GPT1 |
| Multimodal Text and Image Classification | VALSE plurality | pairwise accuracy | 51.9 | GPT2 |
| Multimodal Text and Image Classification | VALSE plurality | Accuracy (%) | 46.5 | VisualBERT |
| Multimodal Text and Image Classification | VALSE plurality | pairwise accuracy | 45.7 | VisualBERT |
| Multimodal Text and Image Classification | VALSE action replacement | pairwise accuracy | 75.6 | CLIP |
| Multimodal Text and Image Classification | VALSE action replacement | Accuracy (%) | 52.6 | ViLBERT |
| Multimodal Text and Image Classification | VALSE action replacement | pairwise accuracy | 70.7 | ViLBERT |
| Multimodal Text and Image Classification | VALSE action replacement | pairwise accuracy | 66.8 | GPT2 |
| Multimodal Text and Image Classification | VALSE action replacement | Accuracy (%) | 57.3 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE action replacement | pairwise accuracy | 65.9 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE action replacement | pairwise accuracy | 65.4 | GPT1 |
| Multimodal Text and Image Classification | VALSE action replacement | Accuracy (%) | 51.1 | LXMERT |
| Multimodal Text and Image Classification | VALSE action replacement | pairwise accuracy | 54.8 | LXMERT |
| Multimodal Text and Image Classification | VALSE action replacement | Accuracy (%) | 48.8 | VisualBERT |
| Multimodal Text and Image Classification | VALSE action replacement | pairwise accuracy | 49.2 | VisualBERT |
| Multimodal Text and Image Classification | VALSE | Average Accuracy | 63.2 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE | average pairwise accuracy | 75.1 | ViLBERT 12-in-1 |
| Multimodal Text and Image Classification | VALSE | average pairwise accuracy | 64 | CLIP |
| Multimodal Text and Image Classification | VALSE | Average Accuracy | 51.3 | ViLBERT |
| Multimodal Text and Image Classification | VALSE | average pairwise accuracy | 63.7 | ViLBERT |
| Multimodal Text and Image Classification | VALSE | average pairwise accuracy | 60.7 | GPT1 |
| Multimodal Text and Image Classification | VALSE | average pairwise accuracy | 60.1 | GPT2 |
| Multimodal Text and Image Classification | VALSE | Average Accuracy | 53.5 | LXMERT |
| Multimodal Text and Image Classification | VALSE | average pairwise accuracy | 59.6 | LXMERT |
| Multimodal Text and Image Classification | VALSE | Average Accuracy | 48.8 | VisualBERT |
| Multimodal Text and Image Classification | VALSE | average pairwise accuracy | 46.4 | VisualBERT |