Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 82 | OFA |
| Visual Question Answering (VQA) | GRIT | VQA (ablation) | 72.4 | OFA |
| Visual Question Answering (VQA) | VQA v2 test-std | number | 71.44 | OFA |
| Visual Question Answering (VQA) | VQA v2 test-std | other | 73.35 | OFA |
| Visual Question Answering (VQA) | VQA v2 test-std | overall | 81.98 | OFA |
| Visual Question Answering (VQA) | VQA v2 test-std | yes/no | 94.66 | OFA |
| Natural Language Inference | SNLI-VE val | Accuracy | 91 | OFA |
| Natural Language Inference | SNLI-VE test | Accuracy | 91.2 | OFA |
| Image Captioning | COCO Captions | BLEU-4 | 44.9 | OFA |
| Image Captioning | COCO Captions | CIDER | 154.9 | OFA |
| Image Captioning | COCO Captions | METEOR | 32.5 | OFA |
| Image Captioning | COCO Captions | SPICE | 26.6 | OFA |
| Text Summarization | GigaWord | ROUGE-1 | 39.81 | OFA |
| Text Summarization | GigaWord | ROUGE-2 | 20.66 | OFA |
| Text Summarization | GigaWord | ROUGE-L | 37.11 | OFA |
| Object Categorization | GRIT | Categorization (ablation) | 22.6 | OFA_Large |
| Visual Question Answering | VQA v2 test-dev | Accuracy | 82 | OFA |
| Visual Question Answering | GRIT | VQA (ablation) | 72.4 | OFA |
| Visual Question Answering | VQA v2 test-std | number | 71.44 | OFA |
| Visual Question Answering | VQA v2 test-std | other | 73.35 | OFA |
| Visual Question Answering | VQA v2 test-std | overall | 81.98 | OFA |
| Visual Question Answering | VQA v2 test-std | yes/no | 94.66 | OFA |