OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang

2021-07-01Cross-Modal Retrieval Text to Audio Retrieval Audio to Text Retrieval Image-to-Text Retrieval Image Retrieval

Paper PDF Code Code

Abstract

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	Localized Narratives	Text-to-image R@1	0.4196	OPT
Image Retrieval	Localized Narratives	Text-to-image R@10	0.8126	OPT
Image Retrieval	Localized Narratives	Text-to-image R@5	0.72	OPT
Text to Audio Retrieval	Localized Narratives	Text-to-audio R@1	0.78	OPT
Text to Audio Retrieval	Localized Narratives	Text-to-audio R@10	0.958	OPT
Text to Audio Retrieval	Localized Narratives	Text-to-audio R@5	0.927	OPT

Related Papers

FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11 MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval2025-07-09 Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09 Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval2025-07-08 An analysis of vision-language models for fabric retrieval2025-07-07 Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model2025-07-07