TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Estimated Audio-Caption Correspondences Improve Language-B...

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Paul Primus, Florian Schmid, Gerhard Widmer

2024-08-21Text to Audio RetrievalContrastive LearningRetrieval
PaperPDFCode(official)

Abstract

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

Results

TaskDatasetMetricValueModel
Text to Audio RetrievalClothoR@127.69PaSST-RoBERTa & Estimated Audio–Caption Correspondences
Text to Audio RetrievalClothoR@1070.39PaSST-RoBERTa & Estimated Audio–Caption Correspondences
Text to Audio RetrievalClothoR@557.03PaSST-RoBERTa & Estimated Audio–Caption Correspondences
Text to Audio RetrievalClothomAP@1040.14PaSST-RoBERTa & Estimated Audio–Caption Correspondences

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16