Voice Conversion With Just Nearest Neighbors

Matthew Baas, Benjamin van Niekerk, Herman Kamper

2023-05-30Voice Conversion

Abstract

Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity -- making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods. Code, samples, trained models: https://bshall.github.io/knn-vc

Results

Task	Dataset	Metric	Value	Model
Voice Conversion	LibriSpeech test-clean	Character Error Rate (CER)	2.96	kNN-VC (prematched HiFiGAN)
Voice Conversion	LibriSpeech test-clean	Equal Error Rate	37.15	kNN-VC (prematched HiFiGAN)
Voice Conversion	LibriSpeech test-clean	Word Error Rate (WER)	7.36	kNN-VC (prematched HiFiGAN)
2D Classification	LibriSpeech test-clean	Character Error Rate (CER)	2.96	kNN-VC (prematched HiFiGAN)
2D Classification	LibriSpeech test-clean	Equal Error Rate	37.15	kNN-VC (prematched HiFiGAN)
2D Classification	LibriSpeech test-clean	Word Error Rate (WER)	7.36	kNN-VC (prematched HiFiGAN)
1 Image, 2*2 Stitchi	LibriSpeech test-clean	Character Error Rate (CER)	2.96	kNN-VC (prematched HiFiGAN)
1 Image, 2*2 Stitchi	LibriSpeech test-clean	Equal Error Rate	37.15	kNN-VC (prematched HiFiGAN)
1 Image, 2*2 Stitchi	LibriSpeech test-clean	Word Error Rate (WER)	7.36	kNN-VC (prematched HiFiGAN)

Voice Conversion With Just Nearest Neighbors

Abstract

Results

Related Papers

Voice Conversion With Just Nearest Neighbors

Abstract

Results

Related Papers