Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou
In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-400 | Acc@1 | 88.1 | ONE-PEACE |
| Video | Kinetics-400 | Acc@5 | 97.8 | ONE-PEACE |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 82.6 | ONE-PEACE |
| Visual Question Answering (VQA) | VQA v2 test-std | number | 72.24 | ONE-PEACE |
| Visual Question Answering (VQA) | VQA v2 test-std | other | 74.15 | ONE-PEACE |
| Visual Question Answering (VQA) | VQA v2 test-std | overall | 82.52 | ONE-PEACE |
| Visual Question Answering (VQA) | VQA v2 test-std | yes/no | 94.85 | ONE-PEACE |
| Semantic Segmentation | ADE20K | Params (M) | 1500 | ONE-PEACE |
| Semantic Segmentation | ADE20K | Validation mIoU | 63 | ONE-PEACE |
| Audio Classification | FSD50K | mAP | 69.7 | ONE-PEACE |
| Audio Classification | VGGSound | Top 1 Accuracy | 68.2 | ONE-PEACE (Audio-Visual) |
| Audio Classification | VGGSound | Top 1 Accuracy | 59.6 | ONE-PEACE (Audio-Only) |
| Classification | FSD50K | mAP | 69.7 | ONE-PEACE |
| Classification | VGGSound | Top 1 Accuracy | 68.2 | ONE-PEACE (Audio-Visual) |
| Classification | VGGSound | Top 1 Accuracy | 59.6 | ONE-PEACE (Audio-Only) |
| 10-shot image generation | ADE20K | Params (M) | 1500 | ONE-PEACE |
| 10-shot image generation | ADE20K | Validation mIoU | 63 | ONE-PEACE |
| Image-to-Text Retrieval | Flickr30k | Recall@1 | 97.6 | ONE-PEACE (finetuned, w/o ranking) |
| Image-to-Text Retrieval | Flickr30k | Recall@10 | 100 | ONE-PEACE (finetuned, w/o ranking) |
| Image-to-Text Retrieval | Flickr30k | Recall@5 | 100 | ONE-PEACE (finetuned, w/o ranking) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@1 | 84.1 | ONE-PEACE (ViT-G, w/o ranking) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@10 | 98.3 | ONE-PEACE (ViT-G, w/o ranking) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@5 | 96.3 | ONE-PEACE (ViT-G, w/o ranking) |
| Text to Audio Retrieval | AudioCaps | R@1 | 42.5 | ONE-PEACE |
| Text to Audio Retrieval | AudioCaps | R@10 | 88.4 | ONE-PEACE |
| Text to Audio Retrieval | AudioCaps | R@5 | 77.5 | ONE-PEACE |
| Text to Audio Retrieval | Clotho | R@1 | 22.4 | ONE-PEACE |
| Text to Audio Retrieval | Clotho | R@10 | 62.7 | ONE-PEACE |
| Text to Audio Retrieval | Clotho | R@5 | 49 | ONE-PEACE |