TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/FLAVA: A Foundational Language And Vision Alignment Model

FLAVA: A Foundational Language And Vision Alignment Model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

2021-12-08CVPR 2022 1Zero-shot Text-to-Image RetrievalVisual ReasoningImage-to-Text RetrievalZero-shot Text RetrievalZero-shot Image RetrievalImage Retrieval
PaperPDFCodeCodeCodeCode

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

Results

TaskDatasetMetricValueModel
Image RetrievalCOCO (Common Objects in Context)recall@138.38FLAVA (zero-shot)
Image RetrievalCOCO (Common Objects in Context)recall@567.47FLAVA (zero-shot)
Image RetrievalCOCO (Common Objects in Context)recall@133.29CLIP (zero-shot)
Image RetrievalCOCO (Common Objects in Context)recall@562.47CLIP (zero-shot)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@142.74FLAVA (ViT-B, zero-shot)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@576.76FLAVA (ViT-B, zero-shot)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09