TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OFA: Unifying Architectures, Tasks, and Modalities Through...

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang

2022-02-07Self-Supervised Image ClassificationText-to-Image GenerationText GenerationVisual GroundingReferring ExpressionImage ClassificationObject CategorizationText SummarizationVisual EntailmentReferring Expression ComprehensionImage CaptioningVisual ReasoningImage GenerationVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCodeCodeCode(official)Code

Abstract

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy82OFA
Visual Question Answering (VQA)GRITVQA (ablation)72.4OFA
Visual Question Answering (VQA)VQA v2 test-stdnumber71.44OFA
Visual Question Answering (VQA)VQA v2 test-stdother73.35OFA
Visual Question Answering (VQA)VQA v2 test-stdoverall81.98OFA
Visual Question Answering (VQA)VQA v2 test-stdyes/no94.66OFA
Natural Language InferenceSNLI-VE valAccuracy91OFA
Natural Language InferenceSNLI-VE testAccuracy91.2OFA
Image CaptioningCOCO CaptionsBLEU-444.9OFA
Image CaptioningCOCO CaptionsCIDER154.9OFA
Image CaptioningCOCO CaptionsMETEOR32.5OFA
Image CaptioningCOCO CaptionsSPICE26.6OFA
Text SummarizationGigaWordROUGE-139.81OFA
Text SummarizationGigaWordROUGE-220.66OFA
Text SummarizationGigaWordROUGE-L37.11OFA
Object CategorizationGRITCategorization (ablation)22.6OFA_Large
Visual Question AnsweringVQA v2 test-devAccuracy82OFA
Visual Question AnsweringGRITVQA (ablation)72.4OFA
Visual Question AnsweringVQA v2 test-stdnumber71.44OFA
Visual Question AnsweringVQA v2 test-stdother73.35OFA
Visual Question AnsweringVQA v2 test-stdoverall81.98OFA
Visual Question AnsweringVQA v2 test-stdyes/no94.66OFA

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Making Language Model a Hierarchical Classifier and Generator2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17