Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, Aida Nematzadeh
While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Reasoning | Winoground | Group Score | 21.2 | X-VLM 16M |
| Visual Reasoning | Winoground | Image Score | 24.5 | X-VLM 16M |
| Visual Reasoning | Winoground | Text Score | 46.7 | X-VLM 16M |
| Visual Reasoning | Winoground | Group Score | 21.5 | X-VLM 4M |
| Visual Reasoning | Winoground | Image Score | 26.7 | X-VLM 4M |
| Visual Reasoning | Winoground | Text Score | 44 | X-VLM 4M |
| Visual Reasoning | Winoground | Group Score | 14.5 | BLIP 14M |
| Visual Reasoning | Winoground | Image Score | 18.5 | BLIP 14M |
| Visual Reasoning | Winoground | Text Score | 36.5 | BLIP 14M |
| Visual Reasoning | Winoground | Group Score | 11.7 | BLIP 129M |
| Visual Reasoning | Winoground | Image Score | 15 | BLIP 129M |
| Visual Reasoning | Winoground | Text Score | 35.5 | BLIP 129M |
| Visual Reasoning | Winoground | Group Score | 12.2 | BLIP 129M (CapFilt/L) |
| Visual Reasoning | Winoground | Image Score | 15.2 | BLIP 129M (CapFilt/L) |
| Visual Reasoning | Winoground | Text Score | 34.7 | BLIP 129M (CapFilt/L) |
| Visual Reasoning | Winoground | Group Score | 12.2 | BLIP-ViT/L 129M |
| Visual Reasoning | Winoground | Image Score | 14.5 | BLIP-ViT/L 129M |
| Visual Reasoning | Winoground | Text Score | 34.7 | BLIP-ViT/L 129M |
| Visual Reasoning | Winoground | Group Score | 12.2 | PEVL 14M |
| Visual Reasoning | Winoground | Image Score | 15.7 | PEVL 14M |
| Visual Reasoning | Winoground | Text Score | 33.2 | PEVL 14M |
| Visual Reasoning | Winoground | Group Score | 12.7 | ALBEF 14M |
| Visual Reasoning | Winoground | Image Score | 16.2 | ALBEF 14M |
| Visual Reasoning | Winoground | Text Score | 32.5 | ALBEF 14M |
| Visual Reasoning | Winoground | Group Score | 11 | ALBEF 4M |
| Visual Reasoning | Winoground | Image Score | 15.5 | ALBEF 4M |
| Visual Reasoning | Winoground | Text Score | 29.2 | ALBEF 4M |