TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cross Modal Retrieval with Querybank Normalisation

Cross Modal Retrieval with Querybank Normalisation

Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie

2021-12-23CVPR 2022 1Cross-Modal RetrievalText to Audio RetrievalVideo RetrievalMetric LearningRetrieval
PaperPDFCode(official)

Abstract

Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank2QB-Norm+CLIP2Video
VideoMSR-VTT-1kAtext-to-video R@147.2QB-Norm+CLIP2Video
VideoMSR-VTT-1kAtext-to-video R@1083QB-Norm+CLIP2Video
VideoMSR-VTT-1kAtext-to-video R@573QB-Norm+CLIP2Video
VideoVATEXtext-to-video R@158.8QB-Norm+CLIP2Video
VideoVATEXtext-to-video R@1093.8QB-Norm+CLIP2Video
VideoDiDeMotext-to-video Median Rank2QB-Norm+CLIP4Clip
VideoDiDeMotext-to-video R@143.5QB-Norm+CLIP4Clip
VideoDiDeMotext-to-video R@1080.9QB-Norm+CLIP4Clip
VideoDiDeMotext-to-video R@571.4QB-Norm+CLIP4Clip
VideoLSMDCtext-to-video Median Rank11QB-Norm+CLIP4Clip
VideoLSMDCtext-to-video R@122.4QB-Norm+CLIP4Clip
VideoLSMDCtext-to-video R@1049.5QB-Norm+CLIP4Clip
VideoLSMDCtext-to-video R@540.1QB-Norm+CLIP4Clip
VideoQuerYDtext-to-video R@115.1QB-Norm+TT-CE+
VideoMSVDtext-to-video Median Rank2QB-Norm+CLIP2Video
VideoMSVDtext-to-video R@148QB-Norm+CLIP2Video
VideoMSVDtext-to-video R@1086.2QB-Norm+CLIP2Video
VideoMSVDtext-to-video R@577.9QB-Norm+CLIP2Video
Metric LearningStanford Online ProductsR@178.1QB-Norm+RDML
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2QB-Norm+CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video R@147.2QB-Norm+CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video R@1083QB-Norm+CLIP2Video
Video RetrievalMSR-VTT-1kAtext-to-video R@573QB-Norm+CLIP2Video
Video RetrievalVATEXtext-to-video R@158.8QB-Norm+CLIP2Video
Video RetrievalVATEXtext-to-video R@1093.8QB-Norm+CLIP2Video
Video RetrievalDiDeMotext-to-video Median Rank2QB-Norm+CLIP4Clip
Video RetrievalDiDeMotext-to-video R@143.5QB-Norm+CLIP4Clip
Video RetrievalDiDeMotext-to-video R@1080.9QB-Norm+CLIP4Clip
Video RetrievalDiDeMotext-to-video R@571.4QB-Norm+CLIP4Clip
Video RetrievalLSMDCtext-to-video Median Rank11QB-Norm+CLIP4Clip
Video RetrievalLSMDCtext-to-video R@122.4QB-Norm+CLIP4Clip
Video RetrievalLSMDCtext-to-video R@1049.5QB-Norm+CLIP4Clip
Video RetrievalLSMDCtext-to-video R@540.1QB-Norm+CLIP4Clip
Video RetrievalQuerYDtext-to-video R@115.1QB-Norm+TT-CE+
Video RetrievalMSVDtext-to-video Median Rank2QB-Norm+CLIP2Video
Video RetrievalMSVDtext-to-video R@148QB-Norm+CLIP2Video
Video RetrievalMSVDtext-to-video R@1086.2QB-Norm+CLIP2Video
Video RetrievalMSVDtext-to-video R@577.9QB-Norm+CLIP2Video
Text to Audio RetrievalAudioCapsR@123.9QB-Norm+CE

Related Papers

Unsupervised Ground Metric Learning2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16