TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Oscar: Object-Semantics Aligned Pre-training for Vision-La...

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiao-Wei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao

2020-04-13ECCV 2020 8Cross-Modal RetrievalImage-text matchingImage CaptioningImage-to-Text RetrievalVisual Question Answering (VQA)Image Retrieval
PaperPDFCode(official)CodeCodeCode

Abstract

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy73.82Oscar
Image CaptioningCOCO CaptionsBLEU-441.7Oscar
Image CaptioningCOCO CaptionsCIDER140Oscar
Image CaptioningCOCO CaptionsMETEOR30.6Oscar
Image CaptioningCOCO CaptionsSPICE24.5Oscar
Image Captioningnocaps-val-overallCIDEr80.9OSCAR
Image Captioningnocaps-val-overallSPICE11.3OSCAR
Image RetrievalCOCO (Common Objects in Context)Recall@1098.3Oscar
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@173.5Oscar
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1096Oscar
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@592.2Oscar
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@157.5Oscar
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1089.8Oscar
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@582.8Oscar
Image Retrieval with Multi-Modal QueryCommercialAdsDatasetADD(S) AUC87.45OSCAR
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@173.5Oscar
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1096Oscar
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@592.2Oscar
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@157.5Oscar
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1089.8Oscar
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@582.8Oscar
Cross-Modal Information RetrievalCommercialAdsDatasetADD(S) AUC87.45OSCAR
Cross-Modal RetrievalCOCO 2014Image-to-text R@173.5Oscar
Cross-Modal RetrievalCOCO 2014Image-to-text R@1096Oscar
Cross-Modal RetrievalCOCO 2014Image-to-text R@592.2Oscar
Cross-Modal RetrievalCOCO 2014Text-to-image R@157.5Oscar
Cross-Modal RetrievalCOCO 2014Text-to-image R@1089.8Oscar
Cross-Modal RetrievalCOCO 2014Text-to-image R@582.8Oscar
Cross-Modal RetrievalCommercialAdsDatasetADD(S) AUC87.45OSCAR
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@1099.8Oscar

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09