TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Measuring Progress in Fine-grained Vision-and-Language Und...

Measuring Progress in Fine-grained Vision-and-Language Understanding

Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, Aida Nematzadeh

2023-05-12Visual Reasoning
PaperPDFCodeCode(official)

Abstract

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score21.2X-VLM 16M
Visual ReasoningWinogroundImage Score24.5X-VLM 16M
Visual ReasoningWinogroundText Score46.7X-VLM 16M
Visual ReasoningWinogroundGroup Score21.5X-VLM 4M
Visual ReasoningWinogroundImage Score26.7X-VLM 4M
Visual ReasoningWinogroundText Score44X-VLM 4M
Visual ReasoningWinogroundGroup Score14.5BLIP 14M
Visual ReasoningWinogroundImage Score18.5BLIP 14M
Visual ReasoningWinogroundText Score36.5BLIP 14M
Visual ReasoningWinogroundGroup Score11.7BLIP 129M
Visual ReasoningWinogroundImage Score15BLIP 129M
Visual ReasoningWinogroundText Score35.5BLIP 129M
Visual ReasoningWinogroundGroup Score12.2BLIP 129M (CapFilt/L)
Visual ReasoningWinogroundImage Score15.2BLIP 129M (CapFilt/L)
Visual ReasoningWinogroundText Score34.7BLIP 129M (CapFilt/L)
Visual ReasoningWinogroundGroup Score12.2BLIP-ViT/L 129M
Visual ReasoningWinogroundImage Score14.5BLIP-ViT/L 129M
Visual ReasoningWinogroundText Score34.7BLIP-ViT/L 129M
Visual ReasoningWinogroundGroup Score12.2PEVL 14M
Visual ReasoningWinogroundImage Score15.7PEVL 14M
Visual ReasoningWinogroundText Score33.2PEVL 14M
Visual ReasoningWinogroundGroup Score12.7ALBEF 14M
Visual ReasoningWinogroundImage Score16.2ALBEF 14M
Visual ReasoningWinogroundText Score32.5ALBEF 14M
Visual ReasoningWinogroundGroup Score11ALBEF 4M
Visual ReasoningWinogroundImage Score15.5ALBEF 4M
Visual ReasoningWinogroundText Score29.2ALBEF 4M

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Skywork-R1V3 Technical Report2025-07-08High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning2025-07-07