Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie
Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video Median Rank | 2 | QB-Norm+CLIP2Video |
| Video | MSR-VTT-1kA | text-to-video R@1 | 47.2 | QB-Norm+CLIP2Video |
| Video | MSR-VTT-1kA | text-to-video R@10 | 83 | QB-Norm+CLIP2Video |
| Video | MSR-VTT-1kA | text-to-video R@5 | 73 | QB-Norm+CLIP2Video |
| Video | VATEX | text-to-video R@1 | 58.8 | QB-Norm+CLIP2Video |
| Video | VATEX | text-to-video R@10 | 93.8 | QB-Norm+CLIP2Video |
| Video | DiDeMo | text-to-video Median Rank | 2 | QB-Norm+CLIP4Clip |
| Video | DiDeMo | text-to-video R@1 | 43.5 | QB-Norm+CLIP4Clip |
| Video | DiDeMo | text-to-video R@10 | 80.9 | QB-Norm+CLIP4Clip |
| Video | DiDeMo | text-to-video R@5 | 71.4 | QB-Norm+CLIP4Clip |
| Video | LSMDC | text-to-video Median Rank | 11 | QB-Norm+CLIP4Clip |
| Video | LSMDC | text-to-video R@1 | 22.4 | QB-Norm+CLIP4Clip |
| Video | LSMDC | text-to-video R@10 | 49.5 | QB-Norm+CLIP4Clip |
| Video | LSMDC | text-to-video R@5 | 40.1 | QB-Norm+CLIP4Clip |
| Video | QuerYD | text-to-video R@1 | 15.1 | QB-Norm+TT-CE+ |
| Video | MSVD | text-to-video Median Rank | 2 | QB-Norm+CLIP2Video |
| Video | MSVD | text-to-video R@1 | 48 | QB-Norm+CLIP2Video |
| Video | MSVD | text-to-video R@10 | 86.2 | QB-Norm+CLIP2Video |
| Video | MSVD | text-to-video R@5 | 77.9 | QB-Norm+CLIP2Video |
| Metric Learning | Stanford Online Products | R@1 | 78.1 | QB-Norm+RDML |
| Video Retrieval | MSR-VTT-1kA | text-to-video Median Rank | 2 | QB-Norm+CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 47.2 | QB-Norm+CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 83 | QB-Norm+CLIP2Video |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 73 | QB-Norm+CLIP2Video |
| Video Retrieval | VATEX | text-to-video R@1 | 58.8 | QB-Norm+CLIP2Video |
| Video Retrieval | VATEX | text-to-video R@10 | 93.8 | QB-Norm+CLIP2Video |
| Video Retrieval | DiDeMo | text-to-video Median Rank | 2 | QB-Norm+CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video R@1 | 43.5 | QB-Norm+CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video R@10 | 80.9 | QB-Norm+CLIP4Clip |
| Video Retrieval | DiDeMo | text-to-video R@5 | 71.4 | QB-Norm+CLIP4Clip |
| Video Retrieval | LSMDC | text-to-video Median Rank | 11 | QB-Norm+CLIP4Clip |
| Video Retrieval | LSMDC | text-to-video R@1 | 22.4 | QB-Norm+CLIP4Clip |
| Video Retrieval | LSMDC | text-to-video R@10 | 49.5 | QB-Norm+CLIP4Clip |
| Video Retrieval | LSMDC | text-to-video R@5 | 40.1 | QB-Norm+CLIP4Clip |
| Video Retrieval | QuerYD | text-to-video R@1 | 15.1 | QB-Norm+TT-CE+ |
| Video Retrieval | MSVD | text-to-video Median Rank | 2 | QB-Norm+CLIP2Video |
| Video Retrieval | MSVD | text-to-video R@1 | 48 | QB-Norm+CLIP2Video |
| Video Retrieval | MSVD | text-to-video R@10 | 86.2 | QB-Norm+CLIP2Video |
| Video Retrieval | MSVD | text-to-video R@5 | 77.9 | QB-Norm+CLIP2Video |
| Text to Audio Retrieval | AudioCaps | R@1 | 23.9 | QB-Norm+CE |