TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Align before Search: Aligning Ads Image to Text for Accura...

Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search

Yuanmin Tang, Jing Yu, Keke Gai, Yujing Wang, Yue Hu, Gang Xiong, Qi Wu

2023-09-28Cross-Modal RetrievalImage-text matchingImage to textcross-modal alignmentNatural Language Queries
PaperPDFCode(official)

Abstract

Cross-Modal sponsored search displays multi-modal advertisements (ads) when consumers look for desired products by natural language queries in search engines. Since multi-modal ads bring complementary details for query-ads matching, the ability to align ads-specific information in both images and texts is crucial for accurate and flexible sponsored search. Conventional research mainly studies from the view of modeling the implicit correlations between images and texts for query-ads matching, ignoring the alignment of detailed product information and resulting in suboptimal search performance.In this work, we propose a simple alignment network for explicitly mapping fine-grained visual parts in ads images to the corresponding text, which leverages the co-occurrence structure consistency between vision and language spaces without requiring expensive labeled training data. Moreover, we propose a novel model for cross-modal sponsored search that effectively conducts the cross-modal alignment and query-ads matching in two separate processes. In this way, the model matches the multi-modal input in the same language space, resulting in a superior performance with merely half of the training data. Our model outperforms the state-of-the-art models by 2.57% on a large commercial dataset. Besides sponsored search, our alignment method is applicable for general cross-modal search. We study a typical cross-modal retrieval task on the MSCOCO dataset, which achieves consistent performance improvement and proves the generalization ability of our method. Our code is available at https://github.com/Pter61/AlignCMSS/

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCommercialAdsDatasetADD(S) AUC91.73AlignCMSS
Cross-Modal Information RetrievalCommercialAdsDatasetADD(S) AUC91.73AlignCMSS
Cross-Modal RetrievalCommercialAdsDatasetADD(S) AUC91.73AlignCMSS

Related Papers

Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17CATVis: Context-Aware Thought Visualization2025-07-15Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09Skywork-R1V3 Technical Report2025-07-08RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models2025-07-08An analysis of vision-language models for fabric retrieval2025-07-07DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment2025-07-03