Papers With Code 2 | ML Benchmarks, SotA Results & Code

$\textbf{VocSim (Vocal Similarity Benchmark)}$ is a benchmark designed to evaluate the ability of neural audio embeddings to capture acoustic and perceptual similarity in a $\textbf{zero-shot setting}$ , without task-specific fine-tuning. It addresses the challenge of creating audio representations that $\textbf{generalize across diverse sound types}$ , aiming to mirror the flexibility and nuanced sensitivity of biological auditory systems. The benchmark is built upon the diverse $\textbf{VocSim dataset}$ , comprising $\textbf{125,382 audio clips}$ aggregated from 19 distinct sources. This includes Human Speech (phones, words, utterances, and non-verbal sounds from multiple languages, including specific blind test subsets from indigenous languages), Animal Vocalizations (songbird syllables and calls like zebra finch, Bengalese finch, canary, and giant otter calls), and Environmental Sounds (everyday environmental noises from ESC-50). The dataset is curated into these 19 subsets to stress zero-shot generalization along key axes of variability: $\textbf{Duration}$ (spanning very short to longer clips), $\textbf{Class Granularity}$ (ranging from a few well-populated classes to thousands with limited samples), $\textbf{Recording Conditions}$ (from clean studio to naturalistic, noisy field recordings), and $\textbf{Intra-class Variability}$ (natural differences in voice/animal identity, amplitude, and pacing). VocSim provides a rigorous, training-free platform for assessing the intrinsic content-based organization of modern audio embeddings, offering a new foundation for developing robust, general-purpose audio representations. Evaluation typically uses training-free metrics like Precision@k and Cluster Separation Confusion Fraction (CSCF) on pairwise distance matrices computed directly from embeddings.