What matters when building vision-language models?

Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh

2024-05-03MMR total Long-Context Understanding 1 Image, 2*2 Stitching Image Retrieval

Abstract

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Results

Task	Dataset	Metric	Value	Model
Long-Context Understanding	MMNeedle	1 Image, 2*2 Stitching, Exact Accuracy	18.9	IDEFICS2-8B
Long-Context Understanding	MMNeedle	1 Image, 4*4 Stitching, Exact Accuracy	7.8	IDEFICS2-8B
Long-Context Understanding	MMNeedle	1 Image, 8*8 Stitching, Exact Accuracy	0.9	IDEFICS2-8B
MMR total	MRR-Benchmark	Total Column Score	256	Idefics-2-8B

Related Papers

FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13 RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11 MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval2025-07-09 Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09 Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval2025-07-08 An analysis of vision-language models for fabric retrieval2025-07-07