TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets/VocSim

VocSim

Vocal Similarity Benchmark

AudioCC BY 4.0Introduced 2025-06-10

VocSim (Vocal Similarity Benchmark)\textbf{VocSim (Vocal Similarity Benchmark)}VocSim (Vocal Similarity Benchmark) is a benchmark designed to evaluate the ability of neural audio embeddings to capture acoustic and perceptual similarity in a zero-shot setting\textbf{zero-shot setting}zero-shot setting, without task-specific fine-tuning. It addresses the challenge of creating audio representations that generalize across diverse sound types\textbf{generalize across diverse sound types}generalize across diverse sound types, aiming to mirror the flexibility and nuanced sensitivity of biological auditory systems. The benchmark is built upon the diverse VocSim dataset\textbf{VocSim dataset}VocSim dataset, comprising 125,382 audio clips\textbf{125,382 audio clips}125,382 audio clips aggregated from 19 distinct sources. This includes Human Speech (phones, words, utterances, and non-verbal sounds from multiple languages, including specific blind test subsets from indigenous languages), Animal Vocalizations (songbird syllables and calls like zebra finch, Bengalese finch, canary, and giant otter calls), and Environmental Sounds (everyday environmental noises from ESC-50). The dataset is curated into these 19 subsets to stress zero-shot generalization along key axes of variability: Duration\textbf{Duration}Duration (spanning very short to longer clips), Class Granularity\textbf{Class Granularity}Class Granularity (ranging from a few well-populated classes to thousands with limited samples), Recording Conditions\textbf{Recording Conditions}Recording Conditions (from clean studio to naturalistic, noisy field recordings), and Intra-class Variability\textbf{Intra-class Variability}Intra-class Variability (natural differences in voice/animal identity, amplitude, and pacing). VocSim provides a rigorous, training-free platform for assessing the intrinsic content-based organization of modern audio embeddings, offering a new foundation for developing robust, general-purpose audio representations. Evaluation typically uses training-free metrics like Precision@k and Cluster Separation Confusion Fraction (CSCF) on pairwise distance matrices computed directly from embeddings.

Statistics

Papers
0
Benchmarks
0

Links

Homepage

Tasks

ClusteringZero-Shot Audio Retrieval