TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ONE-PEACE: Exploring One General Representation Model Towa...

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou

2023-05-18DenoisingSelf-Supervised Image ClassificationQuestion AnsweringImage-text RetrievalText to Audio RetrievalImage ClassificationAction ClassificationAudio ClassificationText RetrievalAudio to Text RetrievalReferring Expression ComprehensionAudio Question AnsweringZero-Shot Environment Sound ClassificationSemantic SegmentationZero-shot Text-to-Image RetrievalImage-to-Text RetrievalRetrievalVisual Question Answering (VQA)1 Image, 2*2 StitchiAudio-Visual Question Answering (AVQA)
PaperPDFCode(official)Code

Abstract

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@188.1ONE-PEACE
VideoKinetics-400Acc@597.8ONE-PEACE
Visual Question Answering (VQA)VQA v2 test-devAccuracy82.6ONE-PEACE
Visual Question Answering (VQA)VQA v2 test-stdnumber72.24ONE-PEACE
Visual Question Answering (VQA)VQA v2 test-stdother74.15ONE-PEACE
Visual Question Answering (VQA)VQA v2 test-stdoverall82.52ONE-PEACE
Visual Question Answering (VQA)VQA v2 test-stdyes/no94.85ONE-PEACE
Semantic SegmentationADE20KParams (M)1500ONE-PEACE
Semantic SegmentationADE20KValidation mIoU63ONE-PEACE
Audio ClassificationFSD50KmAP69.7ONE-PEACE
Audio ClassificationVGGSoundTop 1 Accuracy68.2ONE-PEACE (Audio-Visual)
Audio ClassificationVGGSoundTop 1 Accuracy59.6ONE-PEACE (Audio-Only)
ClassificationFSD50KmAP69.7ONE-PEACE
ClassificationVGGSoundTop 1 Accuracy68.2ONE-PEACE (Audio-Visual)
ClassificationVGGSoundTop 1 Accuracy59.6ONE-PEACE (Audio-Only)
10-shot image generationADE20KParams (M)1500ONE-PEACE
10-shot image generationADE20KValidation mIoU63ONE-PEACE
Image-to-Text RetrievalFlickr30kRecall@197.6ONE-PEACE (finetuned, w/o ranking)
Image-to-Text RetrievalFlickr30kRecall@10100ONE-PEACE (finetuned, w/o ranking)
Image-to-Text RetrievalFlickr30kRecall@5100ONE-PEACE (finetuned, w/o ranking)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@184.1ONE-PEACE (ViT-G, w/o ranking)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@1098.3ONE-PEACE (ViT-G, w/o ranking)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@596.3ONE-PEACE (ViT-G, w/o ranking)
Text to Audio RetrievalAudioCapsR@142.5ONE-PEACE
Text to Audio RetrievalAudioCapsR@1088.4ONE-PEACE
Text to Audio RetrievalAudioCapsR@577.5ONE-PEACE
Text to Audio RetrievalClothoR@122.4ONE-PEACE
Text to Audio RetrievalClothoR@1062.7ONE-PEACE
Text to Audio RetrievalClothoR@549ONE-PEACE

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17