Vocal Similarity Benchmark
is a benchmark designed to evaluate the ability of neural audio embeddings to capture acoustic and perceptual similarity in a , without task-specific fine-tuning. It addresses the challenge of creating audio representations that , aiming to mirror the flexibility and nuanced sensitivity of biological auditory systems. The benchmark is built upon the diverse , comprising aggregated from 19 distinct sources. This includes Human Speech (phones, words, utterances, and non-verbal sounds from multiple languages, including specific blind test subsets from indigenous languages), Animal Vocalizations (songbird syllables and calls like zebra finch, Bengalese finch, canary, and giant otter calls), and Environmental Sounds (everyday environmental noises from ESC-50). The dataset is curated into these 19 subsets to stress zero-shot generalization along key axes of variability: (spanning very short to longer clips), (ranging from a few well-populated classes to thousands with limited samples), (from clean studio to naturalistic, noisy field recordings), and (natural differences in voice/animal identity, amplitude, and pacing). VocSim provides a rigorous, training-free platform for assessing the intrinsic content-based organization of modern audio embeddings, offering a new foundation for developing robust, general-purpose audio representations. Evaluation typically uses training-free metrics like Precision@k and Cluster Separation Confusion Fraction (CSCF) on pairwise distance matrices computed directly from embeddings.