TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/What matters when building vision-language models?

What matters when building vision-language models?

Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh

2024-05-03MMR totalLong-Context Understanding1 Image, 2*2 StitchingImage Retrieval
PaperPDF

Abstract

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Results

TaskDatasetMetricValueModel
Long-Context UnderstandingMMNeedle1 Image, 2*2 Stitching, Exact Accuracy18.9IDEFICS2-8B
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy7.8IDEFICS2-8B
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy0.9IDEFICS2-8B
MMR totalMRR-BenchmarkTotal Column Score256Idefics-2-8B

Related Papers

FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval2025-07-09Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval2025-07-08An analysis of vision-language models for fabric retrieval2025-07-07