TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Toward Building General Foundation Models for Language, Vi...

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Xinsong Zhang, Yan Zeng, Jipeng Zhang, Hang Li

2023-01-12Cross-Modal RetrievalVisual GroundingOpen-Ended Question AnsweringVisual ReasoningVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding. Code and pre-trained models are released at https://github.com/zhangxinsong-nlp/XFM.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy80.4XFM (base)
Visual ReasoningNLVR2 DevAccuracy87.6XFM (base)
Visual ReasoningNLVR2 TestAccuracy88.4XFM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@184.2XFM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1098.4XFM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@596.4XFM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@167XFM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1092.4XFM (base)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@587.2XFM (base)
Visual GroundingRefCOCO+ test BAccuracy (%)79.8XFM (base)
Visual GroundingRefCOCO+ valAccuracy (%)86.1XFM (base)
Visual GroundingRefCOCO+ testAAccuracy (%)90.4XFM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@184.2XFM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1098.4XFM (base)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@596.4XFM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@167XFM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1092.4XFM (base)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@587.2XFM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@184.2XFM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@1098.4XFM (base)
Cross-Modal RetrievalCOCO 2014Image-to-text R@596.4XFM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@167XFM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@1092.4XFM (base)
Cross-Modal RetrievalCOCO 2014Text-to-image R@587.2XFM (base)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation2025-07-09