TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/OFA

OFA

Computer VisionIntroduced 200032 papers
Source Paper

Description

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.

Papers Using This Method

MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Prediction Filtering2025-06-16Private MEV Protection RPCs: Benchmark Stud2025-05-26Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation2025-05-21Learning Object Focused Attention2025-04-10Efficient Adaptation For Remote Sensing Visual Grounding2025-03-29Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison2025-02-20Analysis of the Order Flow Auction under Proposer-Builder Separation2025-02-17Memory-Optimized Once-For-All Network2024-09-05Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs2024-08-08Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge2024-07-05Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering2024-06-03The Solution for the CVPR2024 NICE Image Captioning Challenge2024-04-19ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model2024-04-03OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining2023-11-15The Solution for the CVPR2023 NICE Image Captioning Challenge2023-10-10Lightweight In-Context Tuning for Multimodal Unified Models2023-10-08One for All: Towards Training One Graph Model for All Classification Tasks2023-09-29Physics Inspired Hybrid Attention for SAR Target Recognition2023-09-27ALIP: Adaptive Language-Image Pre-training with Synthetic Caption2023-08-16Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models2023-06-03