ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou

Abstract

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-400	Acc@1	88.1	ONE-PEACE
Video	Kinetics-400	Acc@5	97.8	ONE-PEACE
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	82.6	ONE-PEACE
Visual Question Answering (VQA)	VQA v2 test-std	number	72.24	ONE-PEACE
Visual Question Answering (VQA)	VQA v2 test-std	other	74.15	ONE-PEACE
Visual Question Answering (VQA)	VQA v2 test-std	overall	82.52	ONE-PEACE
Visual Question Answering (VQA)	VQA v2 test-std	yes/no	94.85	ONE-PEACE
Semantic Segmentation	ADE20K	Params (M)	1500	ONE-PEACE
Semantic Segmentation	ADE20K	Validation mIoU	63	ONE-PEACE
Audio Classification	FSD50K	mAP	69.7	ONE-PEACE
Audio Classification	VGGSound	Top 1 Accuracy	68.2	ONE-PEACE (Audio-Visual)
Audio Classification	VGGSound	Top 1 Accuracy	59.6	ONE-PEACE (Audio-Only)
Classification	FSD50K	mAP	69.7	ONE-PEACE
Classification	VGGSound	Top 1 Accuracy	68.2	ONE-PEACE (Audio-Visual)
Classification	VGGSound	Top 1 Accuracy	59.6	ONE-PEACE (Audio-Only)
10-shot image generation	ADE20K	Params (M)	1500	ONE-PEACE
10-shot image generation	ADE20K	Validation mIoU	63	ONE-PEACE
Image-to-Text Retrieval	Flickr30k	Recall@1	97.6	ONE-PEACE (finetuned, w/o ranking)
Image-to-Text Retrieval	Flickr30k	Recall@10	100	ONE-PEACE (finetuned, w/o ranking)
Image-to-Text Retrieval	Flickr30k	Recall@5	100	ONE-PEACE (finetuned, w/o ranking)
Image-to-Text Retrieval	COCO (Common Objects in Context)	Recall@1	84.1	ONE-PEACE (ViT-G, w/o ranking)
Image-to-Text Retrieval	COCO (Common Objects in Context)	Recall@10	98.3	ONE-PEACE (ViT-G, w/o ranking)
Image-to-Text Retrieval	COCO (Common Objects in Context)	Recall@5	96.3	ONE-PEACE (ViT-G, w/o ranking)
Text to Audio Retrieval	AudioCaps	R@1	42.5	ONE-PEACE
Text to Audio Retrieval	AudioCaps	R@10	88.4	ONE-PEACE
Text to Audio Retrieval	AudioCaps	R@5	77.5	ONE-PEACE
Text to Audio Retrieval	Clotho	R@1	22.4	ONE-PEACE
Text to Audio Retrieval	Clotho	R@10	62.7	ONE-PEACE
Text to Audio Retrieval	Clotho	R@5	49	ONE-PEACE

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Abstract

Results

Related Papers

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Abstract

Results

Related Papers