TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Voice Conversion With Just Nearest Neighbors

Voice Conversion With Just Nearest Neighbors

Matthew Baas, Benjamin van Niekerk, Herman Kamper

2023-05-30Voice Conversion
PaperPDFCode(official)

Abstract

Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity -- making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods. Code, samples, trained models: https://bshall.github.io/knn-vc

Results

TaskDatasetMetricValueModel
Voice ConversionLibriSpeech test-cleanCharacter Error Rate (CER)2.96kNN-VC (prematched HiFiGAN)
Voice ConversionLibriSpeech test-cleanEqual Error Rate37.15kNN-VC (prematched HiFiGAN)
Voice ConversionLibriSpeech test-cleanWord Error Rate (WER)7.36kNN-VC (prematched HiFiGAN)
2D ClassificationLibriSpeech test-cleanCharacter Error Rate (CER)2.96kNN-VC (prematched HiFiGAN)
2D ClassificationLibriSpeech test-cleanEqual Error Rate37.15kNN-VC (prematched HiFiGAN)
2D ClassificationLibriSpeech test-cleanWord Error Rate (WER)7.36kNN-VC (prematched HiFiGAN)
1 Image, 2*2 StitchiLibriSpeech test-cleanCharacter Error Rate (CER)2.96kNN-VC (prematched HiFiGAN)
1 Image, 2*2 StitchiLibriSpeech test-cleanEqual Error Rate37.15kNN-VC (prematched HiFiGAN)
1 Image, 2*2 StitchiLibriSpeech test-cleanWord Error Rate (WER)7.36kNN-VC (prematched HiFiGAN)

Related Papers

RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding2025-06-12Training-Free Voice Conversion with Factorized Optimal Transport2025-06-11CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition2025-06-06Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion2025-06-04StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion2025-06-03Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech2025-06-02SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction2025-06-02LinearVC: Linear transformations of self-supervised features through the lens of voice conversion2025-06-02